F1 Query: Declarative Querying at Scale

@article{Samwel2018F1QD,
  title={F1 Query: Declarative Querying at Scale},
  author={Bart Samwel and John Cieslewicz and Ben Handy and Jason Govig and Petros Venetis and Chanjun Yang and Keith Peters and Jeff Shute and Daniel Tenedorio and Himani Apte and Felix Weigel and David Wilhite and Jiacheng Yang and Jun Xu and Jiexing Li and Zhan Yuan and Craig Chasseur and Qiang Zeng and Ian Rae and Anurag Biyani and Andrew Harn and Yang Xia and Andrey Gubichev and Amr El-Helw and Orri Erling and Zhepeng Yan and Mohan Yang and Yiqun Wei and Thanh Do and Colin Zheng and Goetz Graefe and Somayeh Sardashti and Ahmed M. Aly and Divyakant Agrawal and Ashish Kumar Gupta and Shivakumar Venkataraman},
  journal={Proc. VLDB Endow.},
  year={2018},
  volume={11},
  pages={1835-1848}
}
F1 Query is a stand-alone, federated query processing platform that executes SQL queries against data stored in different file-based formats as well as different storage systems at Google (e.g., Bigtable, Spanner, Google Spreadsheets, etc.). F1 Query eliminates the need to maintain the traditional distinction between different types of data processing workloads by simultaneously supporting: (i) OLTP-style point queries that affect only a few records; (ii) low-latency OLAP querying of large… 
Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google
TLDR
The developed and deployed in production an analytical data management system, Napa, to meet the extremely demanding requirements of scalability, sub-second query response times, availability, and strong consistency at Google.
External Merge Sort for Top-K Queries: Eager input filtering guided by histograms
TLDR
A new top-k algorithm that is able to eliminate parts of the input before sorting or writing them to secondary storage, regardless of whether the requested output fits in the available memory is introduced.
Scalable Querying of Nested Data
TLDR
This work proposes a framework that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated, and provides an extensive experimental evaluation, demonstrating significant improvements provided by the framework in diverse scenarios for nested collection programs.
DIAMetrics: Benchmarking Query Engines at Scale
TLDR
It is argued that DIAMetrics core concepts can be used more widely to enable comparative end-to-end benchmarking in other industrial environments.
Monarch: Google's Planet-Scale In-Memory Time Series Database
TLDR
The structure of the system and the novel mechanisms that achieve a reliable and flexible unified system on a regionalized distributed architecture are described.
F1 lightning
TLDR
The design and experiences of F1 Lightning, a system built and deployed to meet the challenge of supporting both new and legacy applications that demand transparent fast queries and transactions from this combination, are reported on.
RAMP-TAO: Layering Atomic Transactions on Facebook's Online TAO Data Store
TLDR
The RAMP-TAO protocol is presented, a variation based on the Read Atomic Multi-Partition (RAMP) protocols that can be feasibly deployed in production with minimal overhead while ensuring atomic visibility for a read-optimized workload at scale.
Advancing Analytical Database Systems
TLDR
A new deep learning approach to cardinality estimation is contributed, which is the core problem in cost-based query optimization, and a new neural network model is proposed that can capture correlations between columns, even across tables.
Remus: Efficient Live Migration for Distributed Databases with Snapshot Isolation
TLDR
Remus is the only effective approach to achieve the goal of zero transaction interruption, zero downtime and marginal performance impact, paving the way for applying the shared-nothing architecture to a cloud database which needs to provide elasticity while guaranteeing strict SLAs.
Sort-based grouping and aggregation In-memory b-trees for run generation and merging
TLDR
This paper introduces a new algorithm for sort-based duplicate removal, grouping, and aggregation that can serve as a systems only aggregation algorithm for unsorted inputs, thus preventing erroneous algorithm choices.
...
...

References

SHOWING 1-10 OF 59 REFERENCES
Enabling JSON Document Stores in Relational Systems
TLDR
Argo is presented, a proof-ofconcept mapping layer for storing and querying JSON data in a relational system with an easy-to-use SQL-like query language and NoBench, a micro-benchmark suite for queries overJSON data in NoSQL and SQL systems.
Shasta: Interactive Reporting At Scale
TLDR
Shasta, a middleware system built at Google to support interactive reporting in complex user-facing applications related to Google's Internet advertising business, has significantly improved system scalability and software engineering efficiency compared to the middleware solutions it replaced.
F1: A Distributed SQL Database That Scales
F1 is a distributed relational database system built at Google to support the AdWords business. F1 is a hybrid database that combines high availability, the scalability of NoSQL systems like
Accelerating Big Data analytics with Collaborative Planning in Teradata Aster 6
TLDR
An innovative concept of “Collaborative Planning” is introduced, which results in the removal of redundant operations and a more optimal rearrangement of query plan operators, and reduces query execution times as much as 90.0% in common use cases, resulting in a 24x speedup.
Tenzing a SQL implementation on the MapReduce framework
TLDR
The architecture and implementation of Tenzing are described, benchmarking of typical analytical queries are presented, and several key characteristics of the Tenzing system are presented.
User defined aggregates in object-relational systems
  • Haixun Wang, C. Zaniolo
  • Computer Science
    Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073)
  • 2000
TLDR
A unified solution to these problems which realizes the SQL3 original proposal for user-defined aggregates (U-DAs), and adds significant improvements in terms of expressive power and ease of use.
SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions
TLDR
This paper presents a new approach to implementing a UDF, which it is called SQL/MapReduce (SQL/MR), that overcomes many of these limitations of present UDFs and facilitates highly scalable computation within the database.
Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases
TLDR
This paper describes the architecture of Aurora and the design considerations leading to that architecture, and describes how Aurora achieves consensus on durable state across numerous storage nodes using an efficient asynchronous scheme, avoiding expensive and chatty recovery protocols.
The Snowflake Elastic Data Warehouse
TLDR
The paper highlights some of the key features of Snowflake: extreme elasticity and availability, semi-structured and schema-less data, time travel, and end-to-end security.
The state of the art in distributed query processing
TLDR
The paper presents the “textbook” architecture for distributed query processing and a series of techniques that are particularly useful for distributed database systems, and discusses different kinds of distributed systems such as client-server, middleware (multitier), and heterogeneous database systems and shows how query processing works in these systems.
...
...