Scaling big data mining infrastructure: the twitter experience

  title={Scaling big data mining infrastructure: the twitter experience},
  author={Jimmy J. Lin and Dmitriy V. Ryaboy},
  journal={SIGKDD Explor.},
The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this paper, we discuss the evolution of our infrastructure and the development of capabilities for data mining on "big data". One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life "in the trenches" is occupied by much preparatory work that… 

Figures and Tables from this paper

A survey of big data in social media using data mining techniques

  • Sheela GoleB. Tidke
  • Computer Science
    2015 International Conference on Advanced Computing and Communication Systems
  • 2015
This paper deals with all these 5Vs, features, challenges, future of Big Data in social media arena using data mining algorithms, tools and Hadoop framework for overcoming challenges of Big data.

Fast data in the era of big data: Twitter's real-time related query suggestion architecture

A case study illustrating the challenges of real-time data processing in the era of "big data", and the story of how the system was built twice, which points the way to future work on data analytics platforms that can handle "big" as well as "fast" data.

Curating Big Data Made Simple: Perspectives from Scientific Communities

The architecture and design of a cloud platform that meets some of these requirements are presented, and a big data curation model that describes how a community of earth and environmental scientists is using the platform to curate data is described.

A Spectrum of Big Data Applications for Data Analytics

This chapter provides broad view of big data in medical application domain and a framework which can handle big data by using several preprocessing and data mining technique to discover hidden knowledge from large scale databases is designed and implemented.

Big Data Analytic Framework for Organizational Leverage

This paper explores the BDA process and capabilities in leveraging data via three case studies who are prime users of BDA tools and emphasizes four key components of the Bda process framework: system coordination, data sourcing, big data application service, and end users.

A general perspective of Big Data: applications, tools, challenges and trends

This paper aims to provide a comprehensive review of Big Data literature of the last 4 years, to identify the main challenges, areas of application, tools and emergent trends of Big data.

Migrating GIS Big Data Computing from Hadoop to Spark: An Exemplary Study Using Twitter

In this paper, an emerging system named Spark is investigated and a timely pilot experience on geospatial big data research is presented and optimization strategies on using Spark for different geosp spatial computing tasks are discussed.

Review: Big Data Techniques of Google, Amazon, Facebook and Twitter

This study is a useful reference for many researchers to identify the differences of big data approaches and technological analysis in comparison to Google, Facebook, Twitter and Amazon big data techniques and outline their, variations and similarities analysis.

Cluster-discovery of Twitter messages for event detection and trending

Big data and ICT applications: A study

The paper tries to establish the wide range of applications of big data in ICT with the currently available data mining & data analytics platforms, languages and tools.



The Unified Logging Infrastructure for Data Analytics at Twitter

This paper presents Twitter's production logging infrastructure and its evolution from application-specific logging to a unified "client events" log format, where messages are captured in common, well-formatted, flexible Thrift messages.

Distilling Massive Amounts of Data into Simple Visualizations : Twitter Case Studies

The purpose of this paper is to highlight the “pulse” of the global conversation on Twitter, often in reaction to major news events around the world.

Large-scale machine learning at twitter

A case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification.

Data warehousing and analytics infrastructure at facebook

This paper presents how Scribe, Hadoop and Hive together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook and enabled us to implement a data warehouse that stores more than 15PB of data and loads more than 60TB of new data every day.

MAD Skills: New Analysis Practices for Big Data

This paper highlights the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence, and describes database design methodologies that support the agile working style of analysts in these settings.

Full-text indexing for optimizing selection operations in large-scale data analytics

It is shown that it is possible to leverage a full-text index to optimize selection operations on text fields within records in Hadoop, and moderate improvements in end-to-end query running times and substantial savings in terms of cumulative processing time at the worker nodes are shown.

Building LinkedIn's Real-time Activity Data Pipeline

The design and engineering problems the authors encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka are discussed.

RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

This paper presents a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system and shows the effectiveness of RCFile in satisfying the four requirements.

The MADlib Analytics Library or MAD Skills, the SQL

The MADlib project is introduced, including the background that led to its beginnings, and the motivation for its open-source nature, and an overview of the library's architecture and design patterns is provided, and a description of various statistical methods in that context is provided.

Enterprise Data Analysis and Visualization: An Interview Study

This work characterize the process of industrial data analysis and document how organizational features of an enterprise impact it, and describes recurring pain points, outstanding challenges, and barriers to adoption for visual analytic tools.