• Corpus ID: 232104901

Enhancing the Interactivity of Dataframe Queries by Leveraging Think Time

  title={Enhancing the Interactivity of Dataframe Queries by Leveraging Think Time},
  author={Doris Xin and Devin Petersohn and Dixin Tang and Yifan Wu and Joseph Gonzalez and Joseph M. Hellerstein and Anthony D. Joseph and Aditya G. Parameswaran},
  journal={IEEE Data Eng. Bull.},
We propose opportunistic evaluation, a framework for accelerating interactions with dataframes. Interactive latency is critical for iterative, human-in-the-loop dataframe workloads for supporting exploratory data analysis. Opportunistic evaluation significantly reduces interactive latency by 1) prioritizing computation directly relevant to the interactions and 2) leveraging think time for asynchronous background computation for non-critical operators that might be relevant to future… 

Figures from this paper

Lux: Always-on Visualization Recommendations for Exploratory Dataframe Workflows
Lux, an always-on framework for accelerating visual insight discovery in dataframe workflows, is proposed, which features a highlevel language for generating visualizations on demand to encourage rapid visual experimentation with data.
Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System
Modin translates pandas functions into a core set of operators that are individually parallelized via columnar, row-wise, or cell-wise decomposition rules that are formalized in this paper and introduces metadata independence to allow metadata to be decoupled from the physical representation and maintained lazily.


Accelerating Complex Analytics using Speculation
This work proposes a new query processing paradigm that accelerates inter-dependent queries using speculation, and enables fast and accurate predictions through approximate query processing (AQP), and efficiently validate speculations through a new streaming join operator.
AFrame: Extending DataFrames for Large-Scale Modern Data Analysis
The architecture of AFrame is presented, the underlying capabilities of AsterixDB that efficiently support modern data analytic operations are described, and an extensible micro-benchmark is introduced for use in evaluating DataFrame performance in both single-node and distributed settings via a collection of representative analytic operations.
Towards scalable dataframe systems
This paper reports on the experience building Modin, a scaled-up implementation of the most widely-used and complex dataframe API today, Python's pandas, and proposes a simple data model and algebra for dataframes to ground discussion in the field.
Magpie: Python at Speed and Scale using Cloud Backends
A system is described, coined Magpie, which exposes the popular Pandas API while lazily pushing large chunks of computation into scalable, efficient, and secured database engines, bringing together the ease of use and versatility of Python environments with the enterprise-grade, high-performance query processing of cloud database systems.
Fine-Grained Lineage for Safer Notebook Interactions
NBSafety is presented, a custom Jupyter kernel that uses runtime tracing and static analysis to automatically manage lineage associated with cell execution and global notebook state and prevents errors that users make during unaided notebook interactions, all while preserving the flexibility of existing notebook semantics.
Putting Pandas in a Box
This work presents the approach to push down the computational part of Pandas scripts into the DBMS by using a transpiler, and shows the usage of this feature to implement a so-called model join, i.e. applying pre-trained ML models to data in SQL tables.
Jupyter Notebooks - a publishing format for reproducible computational workflows
Jupyter notebooks, a document format for publishing code, results and explanations in a form that is both readable and executable, is presented.
Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks
This work crawled over 4M Jupyter notebooks on GitHub, and replayed them step-by-step, to observe not only full input/output tables at each step, but also the exact data-preparation choices data scientists make that they believe are best suited to the input data.
Engineering a Compiler
Koalas: pandas api on apache spark
  • 2020