Real-time Data Infrastructure at Uber

  title={Real-time Data Infrastructure at Uber},
  author={Yupeng Fu and Chinmay Soman},
  journal={Proceedings of the 2021 International Conference on Management of Data},
  • Yupeng Fu, Chinmay Soman
  • Published 31 March 2021
  • Computer Science
  • Proceedings of the 2021 International Conference on Management of Data
Uber's business is highly real-time in nature. PBs of data is continuously being collected from the end users such as Uber drivers, riders, restaurants, eaters and so on everyday. There is a lot of valuable information to be processed and many decisions must be made in seconds for a variety of use cases such as customer incentives, fraud detection, machine learning model prediction. In addition, there is an increasing need to expose this ability to different user categories, including engineers… 

Figures and Tables from this paper

An Automated Cost Prediction in Uber/Call Taxi Using Machine Learning Algorithm

The motive of this paper is to compare all the fare details of specified cabs and predict the lowest fare cab using linear regression method and build an application that can assist the users to select the cab with the determined benefits and lowest fare.

Recent Advances in Wearable Sensing Technologies

The use of consumer wearables during the coronavirus disease 19 (COVID-19) pandemic caused by the severe acute respiratory syndrome coronav virus 2 (SARS-CoV-2), and open challenges that must be addressed to further improve the efficacy of wearable sensing systems in the future are discussed.

Meces: Latency-efficient Rescaling via Prioritized State Migration for Stateful Distributed Stream Processing Systems

This paper proposes Meces, an on-the-fly state migration mechanism which prioritizes the state migration of hot keys (those being processed or about to be processed by downstream operator tasks) to achieve smooth rescaling.



Realtime Data Processing at Facebook

This paper identifies five important design decisions that affect their ease of use, performance, fault tolerance, scalability, and correctness in the realtime stream processing systems Puma, Swift, and Stylus and illustrates how these decisions and systems satisfy the requirements for multiple use cases at Facebook.

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

One such approach is presented, the Dataflow Model, along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development.

Twitter Heron: Stream Processing at Scale

Heron is now the de facto stream data processing engine inside Twitter, and in this paper the design and implementation of this new system, called Heron are presented and the experiences from running Heron in production are shared.

Helios: Hyperscale Indexing for the Cloud & Edge

The simple data model behind Helios is presented, which offers great flexibility and control over costs, and enables the system to asynchronously index massive streams of data.

Big Data: Principles and best practices of scalable realtime data systems

Big Data describes a scalable, easy to understand approach to big data systems that can be built and run by a small team that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.

Presto: SQL on Everything

This paper outlines a selection of use cases that Presto supports at Facebook, and describes its architecture and implementation, and calls out features and performance optimizations that enable it to support these use cases.

Apache Flink: Stream Analytics at Scale

This half-day tutorial will introduce Apache Flink, and give a tutorial on its streaming capabilities using concrete examples of application scenarios, focusing on concepts such as stream windowing, and stateful operators.

F1: the fault-tolerant distributed RDBMS supporting google's ad business

F1 is a novel hybrid system that combines the scalability, fault tolerance, transparent sharding, and cost benefits so far available only in "NoSQL" systems with the usability, familiarity, and transactional guarantees expected from an RDBMS.

Procella: Unifying serving and analytical data at YouTube

Procella implements a superset of capabilities required to address all of the four use cases above, with high scale and performance, in a single product.

Photon: fault-tolerant and scalable joining of continuous data streams

The architecture of Photon is described, a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency, where the streams may be unordered or delayed.