Wrangler: interactive visual specification of data transformation scripts

  title={Wrangler: interactive visual specification of data transformation scripts},
  author={Sean Kandel and Andreas Paepcke and Joseph M. Hellerstein and Jeffrey Heer},
  journal={Proceedings of the SIGCHI Conference on Human Factors in Computing Systems},
Though data analysis tools continue to improve, analysts still expend an inordinate amount of time and effort manipulating data and assessing data quality issues. Such "data wrangling" regularly involves reformatting data values or layout, correcting erroneous or missing values, and integrating multiple data sources. These transforms are often difficult to specify and difficult to reuse across analysis tasks, teams, and tools. In response, we introduce Wrangler, an interactive system for… 

Figures from this paper

Research directions in data wrangling: Visualizations and transformations for usable and credible data

It is argued that analysts might more effectively wrangle data through new interactive systems that integrate data verification, transformation, and visualization.

Proactive wrangling: mixed-initiative end-user programming of data transformation scripts

A model to proactively suggest data transforms which map input data to a relational format expected by analysis tools is presented, and a metric that scores tables according to type homogeneity, sparsity and the presence of delimiters is proposed.

Towards Automatic Data Format Transformations: Data Wrangling at Scale

This paper proposes two approaches to identifying candidate data examples and validating the transformations that are synthesized from them, and is evaluated empirically using datasets from open government data.

DataXFormer: A robust transformation discovery system

This paper presents the full fledged DataXFormer system, a system that discovers possible transformations from web tables and web forms and involves human feedback where appropriate, and presents algorithms to find transformations that entail multiple columns of input data.

Data-centric disambiguation for data transformation with programming-by-example

A novel approach: data-centric disambiguation for data transformation, where users resolve the ambiguity in data transformation by examining and modifying the output rather than the program.

UNCHARTIT: An Interactive Framework for Program Recovery from Charts

This paper tackles the problem of recovering data transformations from existing charts by automatically recovering the data transformation program underlying the chart in a tool called UNCHARTIT and evaluated it on a set of 50 benchmarks from Kaggle.

Wrex: A Unified Programming-by-Example Interaction for Synthesizing Readable Code for Data Scientists

It is suggested that presenting readable code to professional data scientists is an indispensable component of offering data wrangling tools in notebooks.

Natto : Rapid Visual Iteration of Analytic Data Models with Intelligent Assistance

This study investigates how analysts process and transform large sets of XML data to create an analytic data model useful to further their analysis, and implements Natto as a proof-of-concept prototype that actualizes a set of visual and interaction design choices.

Foofah: Transforming Data By Example

This paper develops a technique to synthesize data transformation programs by example, reducing this burden by allowing the analyst to describe the transformation with a small input-output example pair, without being concerned with the transformation steps required to get there.

B2: Bridging Code and Interactive Visualization in Computational Notebooks

B2, a set of techniques grounded in treating data queries as a shared representation between the code and interactive visualizations, is presented and found that B2 promotes a tighter feedback loop between coding and interacting with visualizations.



Potter's Wheel: An Interactive Data Cleaning System

Potter’s Wheel is presented, an interactive data cleaning system that tightly integrates transformation and discrepancy detection, and users can gradually build a transformation as discrepancies are found, and clean the data without writing complex programs or enduring long delays.

Interactive Data Integration through Smart Copy & Paste

A novel smart copy and paste (SCP) model and architecture for seamlessly combining the design-time and run-time aspects of data integration, and an initial prototype, the CopyCat system are proposed.

Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation

A novel user interface, D-Dupe, for interactive entity resolution in relational data that effectively combines relational entity resolution algorithms with a novel network visualization that enables users to make use of an entity's relational context for making resolution decisions.

PADS: a domain-specific language for processing ad hoc data

From such descriptions, the PADS compiler generates libraries and tools for manipulating the data, including parsing routines, statistical profiling tools, translation programs to produce well-behaved formats such as Xml or those required for loading relational databases, and Tools for running XQueries over raw PADS data sources.

AJAX: an extensible data cleaning tool

The AJAX system applied to two real world problems: the consolidation of a telecommunication database, and the conversion of a dirty database of bibliographic references into a set of clean, normalized, and redundancy free relational tables maintaining the same data are presented.

Mining database structure; or, how to build a data quality browser

Techniques for quickly identifying which fields have similar values, identifying join paths, estimating join directions and sizes, and identifying structures in the database are presented.

Intelligently creating and recommending reusable reformatting rules

A user interface is described where a user can describe the formats of each kind of data and an algorithm is provided that uses these formats to automatically generate reformatting rules that transform strings from one format to another.

Potluck: Semi-Ontology Alignment for Casual Users

Potluck is a web user interface (Figure 1) that lets casual users-- those without programming skills and data modeling expertise--repurpose heterogeneous Semantic Web data. It lets users merge,

Quantitative Data Cleaning for Large Databases

A statistical view of data quality is taken, with an emphasis on intuitive outlier detection and exploratory data analysis methods based in robust statistics, and algorithms and implementations that can be easily and efficiently implemented in very large databases, and which are easy to understand and visualize graphically are stressed.

Interactive Simultaneous Editing of Multiple Text Regions

This work describes a generalization method that is fast (suitable for interactive use), domain-specific (capable of using high-level knowledge such as Java and HTML syntax), and under user control (generalizations can be corrected or overridden).