Distributed Log Analysis on the Cloud Using MapReduce

@article{Aydin2018DistributedLA,
  title={Distributed Log Analysis on the Cloud Using MapReduce},
  author={Galip Aydin and Ibrahim Riza Hallac},
  journal={ArXiv},
  year={2018},
  volume={abs/1802.03589}
}
In this paper we describe our work on designing a web based, distributed data analysis system based on the popular MapReduce framework deployed on a small cloud; developed specifically for analyzing web server logs. The log analysis system consists of several cluster nodes, it splits the large log files on a distributed file system and quickly processes them using MapReduce programming model. The cluster is created using an open source cloud infrastructure, which allows us to easily expand the… Expand
1 Citations
Nowhere to Hide Methodology: Application of Clustering Fault Diagnosis in the Nuclear Power Industry
TLDR
The nowhere to hide (NTH) methodology is proposed, an efficient method to diagnose faults and locate root causes and the results show that system administrators can efficiently determine the root cause with the proposed methodology. Expand

References

SHOWING 1-10 OF 12 REFERENCES
An Internet traffic analysis method with MapReduce
TLDR
An Internet flow analysis method based on the MapReduce software framework of the cloud computing platform for a large-scale network that improves the flow statistics computation time by 72%, when compared with the popular flow data processing tool, flow-tools, on a single host. Expand
Toward scalable internet traffic measurement and analysis with Hadoop
TLDR
This paper presents a Hadoop-based traffic monitoring system that performs IP, TCP, HTTP, and NetFlow analysis of multi-terabytes of Internet traffic in a scalable manner and explains the performance issues related with traffic analysis MapReduce jobs. Expand
HaLoop: Efficient Iterative Data Processing on Large Clusters
TLDR
HaLoop is presented, a modified version of the Hadoop MapReduce framework that is designed to serve iterative applications and dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms. Expand
MapReduce: Simplified Data Processing on Large Clusters
TLDR
This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable. Expand
Hadoop: The Definitive Guide
TLDR
This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters. Expand
Spark: Cluster Computing with Working Sets
TLDR
Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time. Expand
Applying Hadoop for log analysis toward distributed IDS
TLDR
The proposed applying K-Means algorithm to cluster high volume log data is useful in classifying minority as possible intruders and IP address summarization method to capture the characteristic of each cluster. Expand
YSmart: Yet Another SQL-to-MapReduce Translator
TLDR
Y Smart, a correlation aware SQL-to-MapReduce translator that applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query, can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators. Expand
Big data and cloud computing: current state and future opportunities
TLDR
This tutorial presents an organized picture of the challenges faced by application developers and DBMS designers in developing and deploying internet scale applications, and crystallizes the design choices made by some successful systems large scale database management systems, analyze the application demands and access patterns, and enumerate the desiderata for a cloud-bound DBMS. Expand
Fast and Interactive Analytics over Hadoop Data with Spark
Matei Zaharia is a fifth­ year PhD student at UC Berkeley, working with Scott Shenker and Ion Stoica on topics in computer systems, networks, cloud computing, and big data. He is also a committer onExpand
...
1
2
...