Generating Data with Identical Statistics but Dissimilar Graphics

  title={Generating Data with Identical Statistics but Dissimilar Graphics},
  author={Sangit Chatterjee and Aykut Firat},
  journal={The American Statistician},
  pages={248 - 254}
The Anscombe dataset is popular for teaching the importance of graphics in data analysis. It consists of four datasets that have identical summary statistics (e.g., mean, standard deviation, and correlation) but dissimilar data graphics (scatterplots). In this article, we provide a general procedure to generate datasets with identical summary statistics but dissimilar graphics by using a genetic algorithm based approach. 
Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing
This paper presents a novel method for generating identical datasets which are identical over a number of statistical properties yet produce dissimilar graphs, and allows for control over the graphical appearance of resulting output.
A simple computational procedure for generating ‘matching’ or ‘cloning’ datasets so that they have exactly the same fitted multiple linear regression equation, suggesting that ‘same fit’ procedures may provide a general and useful alternative to model‐based procedures, and have a wide range of applications.
In 1973, Francis Anscombe (Anscombe, 1973) published a fascinating simulated dataset containing four pairs of variables. On calculating the sample correlation coefficient or fitting a least squares
Adding a dimension to Anscombe's quartet: Open source, 3-D data visualization
The development and research goal of this work is to develop an accessible 3-D data tool that allows for a high level of control by the user, and facilitate future studies on the effectiveness and best use practices associated with3-D visualization.
Cloning data with unchanged estimates of estimable non-linear functions of parameters
Cloned datasets for bivariate and multivariate non- linear regression models with the same non-linear regression fit are presented and used for the confidentiality of sensitive data for publication purposes.
Cloning data with unchanged estimates of estimable non-linear functions of parameters
Non-linear regression models occur in the fields of biology, banking, economics, and sociology, population and biological growth. The absolute growth, growth of humans, and most importantly, an
Clustered Iconography: A Resurrected Method for Representing Multidimensional Data
Development of graphical methods for representing data has not kept up with progress in statistical techniques, so a brief history of graphical representations of research findings is presented.
Comparing two samples through stochastic dominance: a graphical approach
This paper introduces a dominance measure for two random variables that quantifies the proportion in which the cumulative distribution function of one of the random variables scholastically dominates the other one and presents a graphical method that decomposes in quantiles the proposed dominance measure.
Significance of Patterns in Data Visualisations
It is shown that it is possible to evaluate the significance of patterns also during exploratory analysis, and that the knowledge of the analyst can be leveraged to improve statistical power by reducing the amount of simultaneous comparisons.


Genetic Algorithms in Search Optimization and Machine Learning
This book brings together the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields.
Detection of Influential Observation in Linear Regression
  • R. Cook
  • Mathematics
  • 2000
A new measure based on confidence ellipsoids is developed for judging the contribution of each data point to the determination of the least squares estimate of the parameter vector in full rank
A simple test for heteroscedasticity and random coefficient variation (econometrica vol 47
A simple test for heteroscedastic disturbances in a linear regression model is developed using the framework of the Lagrangian multiplier test. For a wide range of heteroscedastic and random
Gram‐Schmidt Orthogonalization