• Corpus ID: 221136031

Loghub: A Large Collection of System Log Datasets towards Automated Log Analytics

  title={Loghub: A Large Collection of System Log Datasets towards Automated Log Analytics},
  author={Shilin He and Jieming Zhu and Pinjia He and Michael R. Lyu},
Logs have been widely adopted in software system development and maintenance because of the rich system runtime information they contain. In recent years, the increase of software size and complexity leads to the rapid growth of the volume of logs. To handle these large volumes of logs efficiently and effectively, a line of research focuses on intelligent log analytics powered by AI (artificial intelligence) techniques. However, only a small fraction of these techniques have reached successful… 

Figures and Tables from this paper

LogFlow: Simplified Log Analysis for Large Scale Systems
The paper presents LogFlow, a tool to help human operators in the analysis of logs by automatically constructing graphs of correlations between log entries by using an interpretable predictive model based on a Recurrent Neural Network augmented with a state-of-the-art attention layer.
On the Naturalness and Localness of Software Logs
This paper begins with the hypothesis that log files are natural and local and these attributes can be applied for automating log analysis tasks, and guides the research with six research questions with regards to the naturalness and localness of the log files.
A Comprehensive Survey of Logging in Software: From Logging Statements Automation to Log Mining and Analysis
A systematic literature review and mapping of the contemporary logging practices and log statements’ mining and monitoring techniques and their applications such as in system failure detection and diagnosis is provided.
Survey on Online Log Parsers
This paper focuses on surveying and creating a comparative study on online log parses by analysing the type of technique used, efficiency and accuracy of the parser on a given dataset, time complexity, and their effectiveness in motivating applications.
PRINS: scalable model inference for component-based system logs
The model inference technique, called PRINS, follows a divide-and-conquer approach, and can process large logs much faster than a publicly available and well-known state-of-the-art tool, without significantly compromising the accuracy of inferred models.
LTmatch: A Method to Abstract Pattern from Unstructured Log
LTmatch algorithm is proposed, which implements a log pattern extracting algorithm based on a weighted word matching rate that improves the method in average accuracy when compared with the best result in all the other methods.
Detecting Anomalies in Software Execution Logs with Siamese Network
This paper proposes a novel anomaly detection approach based on the Siamese network, and introduces a method of monitoring the evolutions of logs without label requirements at run-time, and presents a visualization technique that facilitates human administrations of log anomaly detection.
Maintainable Log Datasets for Evaluation of Intrusion Detection Systems
This work presents a collection of maintainable log datasets collected in a testbed representing a small enterprise and uses concepts from model-driven engineering that enable automatic generation and labeling of an arbitrary number of datasets that comprise repetitions of attack executions with variations of parameters.
UniLog: Deploy One Model and Specialize it for All Log Analysis Tasks
This work proposes a log data pretrained transformer to utilize the enormous unlabeled log data, and a corresponding multi-log-tasking finetune strategy for various log analysis tasks.
MithriLog: Near-Storage Accelerator for High-Performance Log Analytics
A log analytics platform with near-storage accelerators for high-performance, cost- and power-efficient unstructured log processing that achieves an order of magnitude higher performance over software systems, even against more expensive machines with enough DRAM to stage the entire dataset.


An Evaluation Study on Log Parsing and Its Use in Log Mining
Four log parsers are studied and package them into a toolkit to allow their reuse and six insightful findings are obtained by evaluating the performance of the log Parsers on five datasets with over ten million raw log messages, while their effectiveness on a real-world log mining task has been thoroughly examined.
Logzip: Extracting Hidden Structures via Iterative Clustering for Log Compression
This paper proposes a novel and effective log compression method, namely logzip, capable of extracting hidden structures from raw logs via fast iterative clustering and further generating coherent intermediate representations that allow for more effective compression.
LogMine: Fast Pattern Recognition for Log Analytics
The proposed method, named LogMine, is a robust method that works for heterogeneous log messages generated in a wide variety of systems and generates patterns which are as good as the patterns generated by exact and unscalable method, while achieving a 500× speedup.
The Unified Logging Infrastructure for Data Analytics at Twitter
This paper presents Twitter's production logging infrastructure and its evolution from application-specific logging to a unified "client events" log format, where messages are captured in common, well-formatted, flexible Thrift messages.
LogSig: generating system events from raw textual logs
This paper proposes a message signature based algorithm logSig to generate system events from textual log messages, which can handle various types of log data, and is able to incorporate human's domain knowledge to achieve a high performance.
Abstracting log lines to log event types for mining software system logs
  • M. NagappanM. Vouk
  • Computer Science
    2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010)
  • 2010
This paper presents a technique based on a clustering technique used in the Simple Log file Clustering Tool for log file abstraction, which is especially useful when the lines in the log file do not conform to a rigid structure.
Filtering failure logs for a BlueGene/L prototype
A three-step filtering algorithm is performed on error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list, to substantially compress these logs, removing over 99.96% of the 828,387 original entries.
Identifying impactful service system problems via log analysis
This paper proposes Log3C, a novel clustering-based approach to promptly and precisely identify impactful system problems, by utilizing both log sequences (a sequence of log events) and system KPIs, which can greatly save the clustering time while keeping high accuracy.
Drain: An Online Log Parsing Approach with Fixed Depth Tree
This work proposes an online log parsing method, namely Drain, that can parse logs in a streaming and timely manner, and uses a fixed depth parse tree, which encodes specially designed rules for parsing.
Assisting developers of Big Data Analytics Applications when deploying on Hadoop clouds
A lightweight approach for uncovering differences between pseudo and large-scale cloud deployments, which makes use of the readily-available yet rarely used execution logs from these platforms and provides very few false positives when identifying deployment failures.