The Split-Apply-Combine Strategy for Data Analysis

@article{Wickham2011TheSS,
  title={The Split-Apply-Combine Strategy for Data Analysis},
  author={H. Wickham},
  journal={Journal of Statistical Software},
  year={2011},
  volume={40},
  pages={1-29}
}
  • H. Wickham
  • Published 2011
  • Computer Science
  • Journal of Statistical Software
Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together. This insight gives rise to a new R package that allows you to smoothly apply this strategy, without having to worry about the type of structure in which your data is stored. The paper includes two case studies showing how these insights make it easier to work with batting… Expand
Towards Exploratory Data Analysis for Pharo
TLDR
The DataFrame and DataSeries collections are introduced - that are specifically designed for working with structured data - that can be used for descriptive statistics and Exploratory Data Analysis (EDA) - the critical first step of data analysis. Expand
SAS and R: Data Management, Statistical Analysis, and Graphics
TLDR
SAS and R: Data Management, Statistical Analysis, and Graphics presents an easy way to learn how to perform an analytical task in both SAS and R, without having to navigate through the extensive, idiosyncratic, and sometimes unwieldy software documentation. Expand
Large-Scale Parallel Statistical Forecasting Computations in R
TLDR
This work generates simulation-based uncertainty bands, which necessitates a large number of computationally intensive realizations, and applies this approach to a forecasting application that fits a variety of models, prohibiting an analytical description of the statistical uncertainty associated with the overall forecast. Expand
Importing z-Tree data into R
TLDR
The purpose of the R -package zTree is to make the process of importing data from z-Tree into the statistical package R transparent, reproducible and simple. Expand
Multiple-Table Data in R with the multitable Package
TLDR
The R multitable package is introduced to provide new data storage objects called data.list objects, which extend the data.frame concept to explicitly multiple-table settings and have dimension attributes that make accessing and manipulating them easier. Expand
BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments
TLDR
Two R packages which greatly simplify working in batch computing environments and use a clear and well-defined interface to the batch system which makes them applicable in most high-performance computing environments are presented. Expand
Data-Specific Functions: A Comment on Kindel et al.
TLDR
This issue describes a new approach to managing survey data in service of the Fragile Families Challenge, which they call “treating metadata as data,” and recommends that data collection efforts distribute an open-source set of tools for working with a particular data set the author calls data-specific functions. Expand
Industrial Research in Applied Statistics
For almost a year, I sat in Washington D.C.’s National airport every Sunday waiting for my flight to Houston. I was 22 years old, with an undergraduate degree in Mathematics, now working inExpand
Programming Plus Subject Expertise: A Combined Approach for Approval Profile Modification
TLDR
This article focused on the analysis of four years of circulation data from print monographs acquired through an approval plan and firm ordering and incorporated interlibrary loan data to compare purchases with demand and enabled librarians to adjust monograph purchases accordingly. Expand
Parallel computing in linear mixed models
TLDR
The proposed method is used to fit LMM with dense and sparse parameters and for large number of observations and is faster than the classical approach and generalizes for big data. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 27 REFERENCES
Reshaping Data with the reshape Package
TLDR
The reshape package for R is presented, which provides a common framework for many types of data reshaping and aggregation, where the data are ‘melted’ into a form which distinguishes measured and identifying variables, and then cast into a new shape, whether it be a data frame, list, or high dimensional array. Expand
Using APL2 to Create an Object-Oriented Environment for Statistical Computation
TLDR
This work designs an extensible computing environment for data analysis and programming based on APL2 that incorporates some of the features of modern statistical programming languages, such as data objects, and convenient implementation of arrays of any dimension. Expand
Glaciers melt as mountains warm: a graphical case study
TLDR
For the 2006 ASA Data Exposition, graphics were created that tried to find expected features in the data, such as seasonal patterns, spatial correlations, and El Niño events, as well as some more surprising results, several of which were corroborated by stories in the news. Expand
A Primer on the R-Tcl/Tk Package
TLDR
This paper intends to get you started with the RTcl/Tk interface, a combination of a scripting language and a toolkit for graphical user interfaces, based on the X11/Unix version of R. Expand
Modern Applied Statistics with S
A guide to using S environments to perform statistical analyses providing both an introduction to the use of S and a course in modern statistical methods. The emphasis is on presenting practicalExpand
abind : Combine Multi - dimensional Arrays
  • 2011
abind: Combine Multi-dimensional Arrays. R package
  • 2011
abind: Combine Multi-dimensional Arrays. R package version
  • 2011
doBy: Groupwise Summary Statistics, General Linear Contrasts
  • 2011
gdata: Various R Programming Tools for Data Manipulation. R package version 2.8.0. With contributions from Ben Bolker
  • 2010
...
1
2
3
...