• Corpus ID: 4542967

Detecting Dependencies in Sparse, Multivariate Databases Using Probabilistic Programming and Non-parametric Bayes

  title={Detecting Dependencies in Sparse, Multivariate Databases Using Probabilistic Programming and Non-parametric Bayes},
  author={Feras A. Saad and Vikash K. Mansinghka},
  booktitle={International Conference on Artificial Intelligence and Statistics},
Datasets with hundreds of variables and many missing values are commonplace. In this setting, it is both statistically and computationally challenging to detect true predictive relationships between variables and also to suppress false positives. This paper proposes an approach that combines probabilistic programming, information theory, and non-parametric Bayes. It shows how to use Bayesian non-parametric modeling to (i) build an ensemble of joint probability models for all the variables; (ii… 

Figures from this paper

Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes

It is found that human evaluators often prefer the results from probabilistic search to results from a standard baseline, and the result is a flexible search technique that applies to a broad class of information retrieval problems, which is integrated into BayesDB.

Bayesian synthesis of probabilistic programs for automatic data modeling

Experimental results show that the techniques presented can accurately infer qualitative structure in multiple real-world data sets and outperform standard data analysis methods in forecasting and predicting new data.

Bayesian Kernelised Test of (In)dependence with Mixed-type Variables

A Bayesian kernelised correlation test of (in)dependence using a Dirichlet process model is proposed and the properties of the approach are theoretically shown, as well as algorithms for fast computation with it.

Artificial intelligence-assisted data analysis with BayesDB

Experiments show that CrossCat, the default model discovery mechanism used by BayesDB, can address all three problems in data analysis effectively, including modeling patterns of missing data, imputing missing values in datasets, and characterizing the error behavior of predictive models.

Hierarchical Infinite Relational Model

The HIRM generalizes the standard infinite relational model and can be used for a variety of data analysis tasks including dependence detection, clustering, and density estimation and is used to discover relational structure in real-world datasets from politics and genomics.

SPPL: probabilistic programming with fast exact symbolic inference

SPPL translates probabilistic programs into sum-product expressions, a new symbolic representation and associated semantic domain that extends standard sum-Product networks to support mixed-type distributions, numeric transformations, logical formulas, and pointwise and set-valued constraints.

A Bayesian nonparametric test for conditional independence

A Bayesian nonparametric method for quantifying the relative evidence in a dataset in favour of the dependence or independence of two variables conditional on a third using Polya tree priors.

Temporally-Reweighted Chinese Restaurant Process Mixtures for Clustering, Imputing, and Forecasting Multivariate Time Series

A Bayesian nonparametric method for forecasting, imputation, and clustering in sparsely observed, multivariate time series data is proposed, demonstrating superior forecasting accuracy and competitive imputation accuracy as compared to multiple widely used baselines.

Human Factors in Model Interpretability: Industry Practices, Challenges, and Needs

The characterization of interpretability work that emerges from the analysis suggests that model interpretability frequently involves cooperation and mental model comparison between people in different roles, often aimed at building trust not only between people and models but also between people within the organization.

Improving Usability, Safety and Patient Outcomes with Health Information Technology - From Research to Practice, Information Technology and Communications in Health Conference, ITCH 2019, Victoria, BC, Canada, 14-17 February 2019

A review of currently available opioid apps for the major operating systems and the number of released apps, service providers, operating systems, target user groups, purpose of app, range of features, location, use of evidence, interface, languages, cost and licensing model, and user ratings is examined.



Probabilistic Data Analysis with Probabilistic Programming

Composable generative population models (CGPMs), a computational abstraction that extends directed graphical models and can be used to describe and compose a broad class of probabilistic data analysis techniques, are introduced.

Context-Specific Independence in Bayesian Networks

This paper proposes a formal notion of context-specific independence (CSI), based on regularities in the conditional probability tables (CPTs) at a node, and proposes a technique, analogous to (and based on) d-separation, for determining when such independence holds in a given network.

A Bayesian nonparametric approach to testing for dependence between random variables

A Bayesian nonparametric procedure that leads to a tractable, explicit and analytic quantification of the relative evidence for dependence vs independence and uses Polya tree priors on the space of probability measures to embedded within a decision theoretic test for dependence.

Nonparametric Bayes inference on conditional independence

An encompassing nonparametric Bayes model is relied on for the joint distribution of Y, X and Z, with conditional mutual information used as a summary of the strength of conditional dependence, and an asymptotic theory supporting the approach is provided.

Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams)

A new, simple, and efficient "Bayes-ball" algorithm is presented which determines irrelevant sets and requisite information more efficiently than existing methods, and is linear in the size of the graph for belief networks and influence diagrams.

Estimating mutual information.

Two classes of improved estimators for mutual information M(X,Y), from samples of random points distributed according to some joint probability density mu(x,y), based on entropy estimates from k -nearest neighbor distances are presented.

Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution

The primary goal of this paper is to compare the choice of conjugate and non-conjugate base distributions on a particular class of DPM models which is widely used in applications, the Dirichlet process Gaussian mixture model (DPGMM).

Scaling Nonparametric Bayesian Inference via Subsample-Annealing

Improved inference on million-row subsamples of US Census data and network log data and a 307-row hospital rating dataset is demonstrated, using a Pitman-Yor generalization of the Cross Categorization model.

Kernel-based Conditional Independence Test and Application in Causal Discovery

A Kernel-based Conditional Independence test (KCI-test) is proposed, by constructing an appropriate test statistic and deriving its asymptotic distribution under the null hypothesis of conditional independence.