Open Source Forge (OSF) websites provide information on massive open source software projects, extracting these web data is important for open source research. Traditional extraction methods use string matching among pages to detect page template, which is time-consuming. A recent work published in VLDB exploits redundant entities among websites to detect(More)
Mining of repeated patterns from HTML documents is the key step towards Web-based data mining and knowledge extraction. Many web crawling applications need efficient repeated patterns mining techniques to generate their wrapper automatically. Existing approaches such as tree matching and string matching can detect repeated patterns with high precision, but(More)
Mining log pattern to analyze the faults in large scale distributed system is affected by the existence of redundant and ambiguous noisy error logs. While existing works try to compress logs in a coarse granularity from temporal and spatial view to remove the redundancy, they fail to reserve those ambiguous logs that might truly relate to a fault, which(More)
