Scaling big data mining infrastructure: the twitter experience

@article{Lin2013ScalingBD,
  title={Scaling big data mining infrastructure: the twitter experience},
  author={Jimmy J. Lin and D. Ryaboy},
  journal={SIGKDD Explor.},
  year={2013},
  volume={14},
  pages={6-19}
}
The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this paper, we discuss the evolution of our infrastructure and the development of capabilities for data mining on "big data". One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life "in the trenches" is occupied by much preparatory work that… Expand
A survey of big data in social media using data mining techniques
  • Sheela Gole, B. Tidke
  • Computer Science
  • 2015 International Conference on Advanced Computing and Communication Systems
  • 2015
TLDR
This paper deals with all these 5Vs, features, challenges, future of Big Data in social media arena using data mining algorithms, tools and Hadoop framework for overcoming challenges of Big data. Expand
Fast data in the era of big data: Twitter's real-time related query suggestion architecture
TLDR
A case study illustrating the challenges of real-time data processing in the era of "big data", and the story of how the system was built twice, which points the way to future work on data analytics platforms that can handle "big" as well as "fast" data. Expand
Curating Big Data Made Simple: Perspectives from Scientific Communities
TLDR
The architecture and design of a cloud platform that meets some of these requirements are presented, and a big data curation model that describes how a community of earth and environmental scientists is using the platform to curate data is described. Expand
A Spectrum of Big Data Applications for Data Analytics
TLDR
This chapter provides broad view of big data in medical application domain and a framework which can handle big data by using several preprocessing and data mining technique to discover hidden knowledge from large scale databases is designed and implemented. Expand
Big Data Analytic Framework for Organizational Leverage
Web data have grown exponentially to reach zettabyte scales. Mountains of data come from several online applications, such as e-commerce, social media, web and sensor-based devices, business webExpand
Migrating GIS Big Data Computing from Hadoop to Spark: An Exemplary Study Using Twitter
TLDR
In this paper, an emerging system named Spark is investigated and a timely pilot experience on geospatial big data research is presented and optimization strategies on using Spark for different geosp spatial computing tasks are discussed. Expand
A general perspective of Big Data: applications, tools, challenges and trends
TLDR
This paper aims to provide a comprehensive review of Big Data literature of the last 4 years, to identify the main challenges, areas of application, tools and emergent trends of Big data. Expand
Review: Big Data Techniques of Google, Amazon, Facebook and Twitter
TLDR
This study is a useful reference for many researchers to identify the differences of big data approaches and technological analysis in comparison to Google, Facebook, Twitter and Amazon big data techniques and outline their, variations and similarities analysis. Expand
Data Mining and Data Pre-processing for Big Data
TLDR
A pre-processing algorithm to extract real time user accessed data from windows operating system environment and an approach from Apache's Hadoop Distributed File System (HDFS) framework using Map Reduce functionality to mine and analyze this large dataset are presented. Expand
Big data and ICT applications: A study
TLDR
The paper tries to establish the wide range of applications of big data in ICT with the currently available data mining & data analytics platforms, languages and tools. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 59 REFERENCES
The Unified Logging Infrastructure for Data Analytics at Twitter
TLDR
This paper presents Twitter's production logging infrastructure and its evolution from application-specific logging to a unified "client events" log format, where messages are captured in common, well-formatted, flexible Thrift messages. Expand
Distilling Massive Amounts of Data into Simple Visualizations : Twitter Case Studies
Twitter is a communications platform on which users can send short, 140-character messages, called “tweets”, to their “followers” via a number of mechanisms, including web clients, mobile clients,Expand
Large-scale machine learning at twitter
TLDR
A case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification. Expand
Data warehousing and analytics infrastructure at facebook
TLDR
This paper presents how Scribe, Hadoop and Hive together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook and enabled us to implement a data warehouse that stores more than 15PB of data and loads more than 60TB of new data every day. Expand
MAD Skills: New Analysis Practices for Big Data
TLDR
This paper highlights the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence, and describes database design methodologies that support the agile working style of analysts in these settings. Expand
Full-text indexing for optimizing selection operations in large-scale data analytics
TLDR
It is shown that it is possible to leverage a full-text index to optimize selection operations on text fields within records in Hadoop, and moderate improvements in end-to-end query running times and substantial savings in terms of cumulative processing time at the worker nodes are shown. Expand
Building LinkedIn's Real-time Activity Data Pipeline
TLDR
The design and engineering problems the authors encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka are discussed. Expand
RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems
TLDR
This paper presents a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system and shows the effectiveness of RCFile in satisfying the four requirements. Expand
The MADlib Analytics Library or MAD Skills, the SQL
TLDR
The MADlib project is introduced, including the background that led to its beginnings, and the motivation for its open-source nature, and an overview of the library's architecture and design patterns is provided, and a description of various statistical methods in that context is provided. Expand
Enterprise Data Analysis and Visualization: An Interview Study
TLDR
This work characterize the process of industrial data analysis and document how organizational features of an enterprise impact it, and describes recurring pain points, outstanding challenges, and barriers to adoption for visual analytic tools. Expand
...
1
2
3
4
5
...