Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

@article{Matejka2017SameSD,
  title={Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing},
  author={Justin Matejka and George W. Fitzmaurice},
  journal={Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems},
  year={2017}
}
  • Justin Matejka, G. Fitzmaurice
  • Published 2 May 2017
  • Computer Science
  • Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems
Datasets which are identical over a number of statistical properties, yet produce dissimilar graphs, are frequently used to illustrate the importance of graphical representations when exploring data. [] Key Method Our technique varies from previous approaches in that new datasets are iteratively generated from a seed dataset through random perturbations of individual data points, and can be directed towards a desired outcome through a simulated annealing optimization strategy. Our method has the benefit of…

Figures from this paper

Same Stats, Different Graphs
TLDR
This work examines all loworder (≤10) non-isomorphic graphs and provides a simple visual analytics system to explore correlations across multiple graph properties and describes a method for generating many graphs that are identical over a number ofgraph properties and statistics yet are clearly different and identifiably distinct.
Same Stats, Different Graphs: Exploring the Space of Graphs in Terms of Graph Properties
TLDR
This work examines low-order non-isomorphic graphs and provides a simple visual analytics system to explore correlations across multiple graph properties, and describes a method for generating many graphs that are identical over a number of graph properties and statistics yet are clearly different and identifiably distinct.
Same Stats, Different Graphs (Graph Statistics and Why We Need Graph Drawings)
TLDR
This work examines all low-order (\le \)10) non-isomorphic graphs and provides a simple visual analytics system to explore correlations across multiple graph properties, and describes a method for generating many graphs that are identical over a number of graph properties and statistics yet are clearly different and identifiably distinct.
Graphs in phylogenetic comparative analysis: Anscombe's quartet revisited
TLDR
The intent of this article is to help build the general case that phylogenetic comparative methods are statistical methods and consequently that graphing or visualization should invariably be included as an essential step in the authors' standard data analytical pipelines.
Comparing two samples through stochastic dominance: a graphical approach
TLDR
This paper introduces a dominance measure for two random variables that quantifies the proportion in which the cumulative distribution function of one of the random variables scholastically dominates the other one and presents a graphical method that decomposes in quantiles the proposed dominance measure.
Evolutionary dataset optimisation: learning algorithm quality through evolution
TLDR
A number of known properties about preferable datasets for the clustering algorithms known as k -means and DBSCAN are realised in the generated datasets.
Real-Time Exploration of Large Spatiotemporal Datasets Based on Order Statistics
TLDR
The Quantile Datacube Structure (QDS) is introduced that bridges this gap by supporting interactive visual exploration based on order statistics by making use of an efficient non-parametric distribution approximation scheme called p-digest and employs a novel datacube indexing scheme that reduces the memory usage of previous datacubes.
Statistical significance calculations for scenarios in visual inference
TLDR
A new approach for computing statistical significance associated with the results from applying a lineup protocol that utilizes a Dirichlet distribution to accommodate different levels of visual interest in individual null panels.
v‐plots: Designing Hybrid Charts for the Comparative Analysis of Data Distributions
TLDR
The v‐plot designer is presented; a technique for authoring custom hybrid charts, combining mirrored bar charts, difference encodings, and violin‐style plots, and v‐plots are customizable and enable the simultaneous comparison of data distributions on global, local, and aggregation levels.
Integrated Development Environment with Interactive Scatter Plot for Examining Statistical Modeling
TLDR
This paper proposes combining a code editor with an interactive scatter plot editor to efficiently understand the behavior of statistical modeling algorithms.
...
...

References

SHOWING 1-10 OF 15 REFERENCES
Generating Data with Identical Statistics but Dissimilar Graphics
TLDR
This article provides a general procedure to generate datasets with identical summary statistics but dissimilar graphics by using a genetic algorithm based approach to Anscombe dataset.
CLONING DATA: GENERATING DATASETS WITH EXACTLY THE SAME MULTIPLE LINEAR REGRESSION FIT
TLDR
A simple computational procedure for generating ‘matching’ or ‘cloning’ datasets so that they have exactly the same fitted multiple linear regression equation, suggesting that ‘same fit’ procedures may provide a general and useful alternative to model‐based procedures, and have a wide range of applications.
Interactive Random Graph Generation with Evolutionary Algorithms
TLDR
The graph generation process from a user's perspective is described, details about the evolutionary algorithm are provided, and how GraphCuisine is employed to generate graphs that mimic a given real-world network are demonstrated.
Graphical inference for infovis
TLDR
The "Rorschach" helps the analyst calibrate their understanding of uncertainty and "line-up" provides a protocol for assessing the significance of visual discoveries, protecting against the discovery of spurious structure.
Illustration of regression towards the means
This article, presents a procedure for generating a sequence of data sets which will yield exactly the same fitted simple linear regression equation y = a + bx. Unless rescaled, the generated data
Graphs in Statistical Analysis
TLDR
In this chapter, the terms and expressions commonly used in medical statistics are defined and other issues encountered in statistics, including interaction, confounding, jack-knifing and co-linearity, are described.
Residual (Sur)Realism
We show how to construct multiple linear regression datasets with the property that the plot of residuals versus predicted values from the least squares fit of the correct model reveals a hidden
Privacy-preserving data publishing: A survey of recent developments
TLDR
This survey will systematically summarize and evaluate different approaches to PPDP, study the challenges in practical data publishing, clarify the differences and requirements that distinguish P PDP from other related problems, and propose future research directions.
Simulated Annealing: Theory and Applications
TLDR
Performance of the simulated annealing algorithm and the relation with statistical physics and asymptotic convergence results are presented.
On Simpson's Paradox and the Sure-Thing Principle
Abstract This paradox is the possibility of P(A|B) <P(A|B') even though P(A|B)≥P(A| B') both under the additional condition C and under the complement C' of that condition. Details are given on why
...
...