Corpus ID: 15474722

Learning from Large-Scale Distributed Health Data : An Approximate Logistic Regression Approach

  title={Learning from Large-Scale Distributed Health Data : An Approximate Logistic Regression Approach},
  author={Che Ngufor and Janusz Wojtusiak},
Research in healthcare is increasingly depending on the access to and analysis of large distributed datasets. Coupled with the exponential rate at which data is being generated, the need for parallel processing is apparent if knowledge is to be efficiently extracted from these large data sets. The HadoopMapReduce framework has evolved into a popular platform for parallelization in many fields, including healthcare. Unfortunately, implementing iterative machine learning algorithms on Hadoop is… Expand

Figures and Tables from this paper

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing
Three scaling techniques enabling machine learning algorithms to learn from large distributed data sets are described, including a general single-pass formula for computing the covariance matrix of large data sets using the MapReduce framework and two new efficient and accurate sampling schemes. Expand
Optimal Integration of Machine Learning Models: A Large-Scale Distributed Learning Framework with Application to Systematic Prediction of Adverse Drug Reactions
This thesis explores three major challenges in this research area: development of techniques that scale up to large and possibly physically distributed databases, construction of exact or approximately exact global models from distributed heterogeneous datasets with minimal data communication while preserving privacy of the data, and how to efficiently learn from modern large-scale datasets. Expand
Big data survey in healthcare and a proposal for intelligent data diagnosis framework
In this research, a framework has been proposed to diagnose the healthcare data for efficient data analysis and there is a lack of a system or way which may help in decision-making in big data analysis in the form of phases. Expand
A Systematic Review of Healthcare Big Data
The present study focuses to determine the extent of healthcare big data analytics together with its applications and challenges in healthcare adoption, evaluating 34 journal articles (between 2015 and 2019) according to the defined inclusion-exclusion criteria. Expand
Multi-task learning with selective cross-task transfer for predicting bleeding and other important patient outcomes
Results for predicting bleeding and need for blood transfusion for patients undergoing non-cardiac operations from an institutional transfusion datamart show that the proposed methods can improve prediction accuracy over standard single-tasks learning methods. Expand
Predicting Breast Cancer via Supervised Machine Learning Methods on Class Imbalanced Data
This study attempts to apply three different class balancing techniques namely oversampling, undersampling and a hybrid method on the Breast Cancer Surveillance Consortium (BCSC) dataset before constructing the supervised learning methods. Expand


Logistic Regression Parameter Estimation Based on Parallel Matrix Computation
By applying a new model based on parallel matrix computing methods, the bottleneck of computing for logistic regression algorithm was overcome successfully and Experimental results proved that the new computing model can achieve nearly linear speedup. Expand
Map-Reduce for Machine Learning on Multicore
This work shows that algorithms that fit the Statistical Query model can be written in a certain "summation form," which allows them to be easily parallelized on multicore computers and shows basically linear speedup with an increasing number of processors. Expand
Optimizing Multiple Machine Learning Jobs on MapReduce
An execution cost model was developed to predict the total execution time of jobs and the optimal assignment was obtained by minimizing the cost model and reduced execution time by a maximum 77% compared to the worst assignment. Expand
MapReduce: Simplified Data Processing on Large Clusters
This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable. Expand
Monte Carlo Linear System Solver using MapReduce
A Monte Carlo based linear system solver that is adapted to the MapReduce model, and compares the resulting parallel efficiency and scalability to the CG implementation shows that the algorithm performs better than the Hadoop CG implementation, however loses to Twister, an alternative Map reduce implementation. Expand
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
In the words of the authors, the goal of this book was to “bring together many of the important new ideas in learning, and explain them in a statistical framework.” The authors have been quiteExpand
Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models
This work examines three common distributed training methods for conditional maxent models, including a study of the convergence of the mixture weight method, the most resource-efficient technique, and presents a theoretical analysis of conditional maximum entropy models. Expand
Bundle Methods for Regularized Risk Minimization
The theory and implementation of a scalable and modular convex solver which solves all these estimation problems, which can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties is described. Expand
Parallelized Stochastic Gradient Descent
This paper presents the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence and introduces a novel proof technique — contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. Expand
Controlling false match rates in record linkage using extreme value theory
This paper presents a new approach for estimating the false match rate within the framework of Fellegi and Sunter by methods of Extreme Value Theory (EVT), which needs no training data for determining the threshold for matches and therefore leads to a significant cost-reduction. Expand