• Corpus ID: 235489955

Scalable Econometrics on Big Data -- The Logistic Regression on Spark

  title={Scalable Econometrics on Big Data -- The Logistic Regression on Spark},
  author={Aurelien Ouattara and Matthieu Bult'e and Wan-Ju Lin and Philipp Scholl and Benedikt Veit and Christos Ziakas and Florian Felice and Julien Virlogeux and George N. Dikos},
Extra-large datasets are becoming increasingly accessible, and computing tools designed to handle huge amount of data efficiently are democratizing rapidly. However, conventional statistical and econometric tools are still lacking fluency when dealing with such large datasets. This paper dives into econometrics on big datasets, specifically focusing on the logistic regression on Spark. We review the robustness of the functions available in Spark to fit logistic regression and introduce a… 

Figures and Tables from this paper


Econometrics at Scale: Spark Up Big Data in Economics
This paper explains how to use Spark to explore big data sets which exceed retail grade computers memory size and run typical econometric tasks including microeconometric, panel data and time series regression models which are prohibitively expensive to evaluate on stand-alone machines.
Big Data: New Tricks for Econometrics
A few tools for manipulating and analyzing big data such as decision trees, support vector machines, neural nets, deep learning, and so on may allow for more effective ways to model complex relationships.
The Data Revolution and Economic Analysis
Many believe that “big data” will transform business, government, and other aspects of the economy. In this article we discuss how new data may impact economic policy and economic research.
Spark: The Definitive Guide: Big Data Processing Made Simple
Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Sparks scalable machine-learning library.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
In the words of the authors, the goal of this book was to “bring together many of the important new ideas in learning, and explain them in a statistical framework.” The authors have been quite
Research Commentary - Too Big to Fail: Large Samples and the p-Value Problem
This research commentary recommends a series of actions the researcher can take to mitigate the p-value problem in large samples and illustrates them with an example of over 300,000 camera sales on eBay.
Spark: Cluster Computing with Working Sets
Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Post-Selection Inference for Generalized Linear Models With Many Controls
This article considers generalized linear models in the presence of many controls. We lay out a general methodology to estimate an effect of interest based on the construction of an instrument that
Square-Root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming
We propose a pivotal method for estimating high-dimensional sparse linear regression models, where the overall number of regressors p is large, possibly much larger than n, but only s regressors are
Econometric Analysis of Cross Section and Panel Data
32.03 MB Free download Econometric Analysis of Cross Section and Panel Data book PDF, FB2, EPUB and MOBI. Read online Econometric Analysis of Cross Section and Panel Data which classified as Other