Data Structures for Statistical Computing in Python

  title={Data Structures for Statistical Computing in Python},
  author={Wes McKinney},
In this paper we are concerned with the practical issues of working with data sets common to finance, statistics, and other related fields. pandas is a new library which aims to facilitate working with these data sets and to provide a set of fundamental building blocks for implementing statistical models. We will discuss specific design issues encountered in the course of developing pandas with relevant examples and some comparisons with the R language. We conclude by discussing possible future… 

pandas: a Foundational Python Library for Data Analysis and Statistics

P pandas is discussed, a Python library of rich data structures and tools for working with structured data sets common to statistics, finance, social sciences, and many other fields that aims to be the foundational layer for the future of statistical computing in Python.

Pingouin: statistics in Python

This presentation explains why Python is far behind the R programming language when it comes to general statistics and why many scientists still rely heavily on R to perform their statistical analyses.

Introduction to Python and Its Statistical Applications

This chapter introduces the history of Python and its IDEs (integrated development environment) and code editors as developing environment and introduces Python libraries, which could be used in statistical analysis.

pandera: Statistical Data Validation of Pandas Dataframes

  • N. Bantilan
  • Computer Science
    Proceedings of the 19th Python in Science Conference
  • 2020
Pandera is introduced, an open source package that provides aible and expressive data validation API designed to make it easy for data wranglers to define dataframe schemas so that analysts can spend less time worrying about the correctness of their dataframes and more time obtaining insights and training models.

Python Implementation of the Dynamic Distributed Dimensional Data Model implements all foundational functionality of D4M and includes Accumulo and SQL database support via Graphulo.jl.

Dask & Numba: Simple libraries for optimizing scientific python code

  • James Crist
  • Computer Science
    2016 IEEE International Conference on Big Data (Big Data)
  • 2016
Numba, a compiler for a subset of the Python language, and Dask, a flexible parallel programming library are described, to allow numeric Python code to be optimized incrementally, requiring minimal changes.

SparkR: Scaling R Programs with Spark

SparkR is presented, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell.

ds-array: A Distributed Data Structure for Large Scale Machine Learning

This paper proposes a novel distributed data structure for dislib, called ds-array, that addresses dislib’s main limitations in data management and results in performance improvements of up to two orders of magnitude over Datasets, while also greatly improving scalability and usability.

nbodykit: A Python Toolkit for Cosmology Simulations and Data Analysis on Parallel HPC Systems

This work takes advantage of the readability of Python as an interpreted language by implementing nbodykit in pure Python, while ensuring high performance by relying on external, compiled libraries, optimized for specific tasks.

Python in Data Science Research and Education

It is demonstrated how Python can be used throughout the entire life cycle of a graduate program in Data Science, starting from introductory classes and culminating in degree capstone research projects using more advanced ideas such as convex optimization, non-linear dimension reduction, and compressed sensing.



AQR Capital Management, pandas: a python data analysis library

    Grunfeld data set

      panel (3D) data

      • panel (3D) data

      PyMC: Markov Chain Monte Carlo for Python