Assisting developers of Big Data Analytics Applications when deploying on Hadoop clouds

@article{Shang2013AssistingDO,
  title={Assisting developers of Big Data Analytics Applications when deploying on Hadoop clouds},
  author={Weiyi Shang and Zhen Ming Jack Jiang and Hadi Hemmati and Bram Adams and A. Hassan and Patrick Martin},
  journal={2013 35th International Conference on Software Engineering (ICSE)},
  year={2013},
  pages={402-411}
}
Big data analytics is the process of examining large amounts of data (big data) in an effort to uncover hidden patterns or unknown correlations. Big Data Analytics Applications (BDA Apps) are a new type of software applications, which analyze big data using massive parallel processing frameworks (e.g., Hadoop). Developers of such applications typically develop them using a small sample of data in a pseudo-cloud environment. Afterwards, they deploy the applications in a large-scale cloud… 

Figures and Tables from this paper

BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark
TLDR
BigDebug designs a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform and shows that BigDebug supports debugging at interactive speeds with minimal performance impact.
CF4BDA: A Conceptual Framework for Big Data Analytics Applications in the Cloud
TLDR
A conceptual framework named CF4BDA is presented to analyze the existing work on Bda applications from two perspectives: 1) the lifecycle of BDA applications and 2) the objects involved in the context of B DA applications in the cloud.
Hug the Elephant: Migrating a Legacy Data Analytics Application to Hadoop Ecosystem
  • Feng Zhu, Jie Liu, Tao Huang
  • Computer Science
    2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)
  • 2016
TLDR
This paper presents the migration of a legacy data analytics application in a provincial data center using a query-aware approach to free developers from tedious manual work and shows that the migrated application achieves high scalability and high performance.
Developing a Real-Time Data Analytics Framework Using Hadoop
TLDR
This paper proposes architecture based on the Storm/YARN projects for data ingestion, processing exploration and visualization of streaming structured and unstructured data, and implements the proposed architecture using Apache Storm related APIs for both of a local mode and a distributed mode.
Interactive Debugging for Big Data Analytics
TLDR
The BIGDEBUG framework is developed, a framework providing interactive debugging primitives and tool-assisted fault localization services for big data analytics that showcases the data provenance and optimized incremental computation features to effectively and efficiently support interactive debugging.
LogM: Log Analysis for Multiple Components of Hadoop Platform
TLDR
This paper develops a framework called LogM that leverages not only the deep learning model, but also the knowledge graph technology for failure prediction and analysis of the Hadoop cluster, and is seen that LogM is highly effective in predicting and diagnosing system failures.
ANALYSIS OF VARIOUS BIG DATA TECHNIQUES FOR SECURITY
TLDR
The analysis of the Big Data Analytics concepts and some existing techniques and tools, like Hadoop, for data Security are discussed.
Autonomic deployment decision making for big data analytics applications in the cloud
TLDR
A novel language is proposed, named DepPolicy, to specify runtime deployment information as policies and a decision making algorithm is proposed that can make different deployment decisions for different jobs in a way that maximises overall utility while satisfying all given constraints.
Going big: a large-scale study on what big data developers ask
TLDR
A set of big data tags are developed to extract big data posts from Stackoverflow and topic modeling is used to group these posts into big data topics, and popularity and difficulty of topics and their correlations are analyzed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 26 REFERENCES
Hadoop: The Definitive Guide
TLDR
This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.
Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience
TLDR
Pig is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce, and performance comparisons between Pig execution and raw Map- Reduce execution are reported.
Disco: a computing platform for large-scale data analytics
TLDR
Disco is a distributed computing platform for MapReduce style computations on large-scale data that provides both a fault-tolerant scheduling and execution layer as well as a distributed and replicated storage layer.
Detecting large-scale system problems by mining console logs
TLDR
This work first parse console logs by combining source code analysis with information retrieval to create composite features, and then analyzes these features using machine learning to detect operational problems to automatically detect system runtime problems.
An Exploratory Study of the Evolution of Communicated Information about the Execution of Large Software Systems
TLDR
This study explores the concept of CI and its evolution by mining the execution logs of one large open source and one industrial software system, and illustrates the need for better trace ability techniques between CI and the Log Processing Apps that analyze the CI.
Interactions with big data analytics
TLDR
It is no surprise today that big data is useful for HCI researchers and user interface design, and A/B testing is a standard practice in the usability community to help determine relative differences in user performance using different interfaces.
Automatic identification of load testing problems
TLDR
This paper presents an approach which mines the execution logs of an application to uncover the dominant behavior for the application and flags anomalies which indicate load testing problems with a relatively small number of false alarms.
Pig latin: a not-so-foreign language for data processing
TLDR
A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
System evolution tracking through execution trace analysis
TLDR
The EvoTrace approach uses standard database technology and instrumentation facilities of development tools, so exchanging data with other analysis tools is facilitated and the applicability is shown using the Mozilla open source system consisting of about 2 million lines of C/C++ code.
Chukwa: A large-scale monitoring system
TLDR
The design and initial implementation of Chukwa, a data collection system for monitoring and analyzing large distributed systems that inherits Hadoop’s scalability and robustness, and includes a flexible and powerful toolkit for displaying monitoring and analysis results.
...
1
2
3
...