Trend of Supervised Web Data Extraction

  title={Trend of Supervised Web Data Extraction},
  author={Galih Hendro Martono and Azhari Azhari and K. Mustafa},
  journal={International Journal of Computer Applications},
Website has evolved since it was first developed in 1990. Since then, the website grows rapidly. Based on the information provided by the number of websites is currently at least 4.54 billion pages. With a very large number, the website stores a lot of information that can be used. That problem brings up the concept of data extraction. Web data extraction aims to retrieve the contents of the website so that it can be easy to use for other purposes. The… 

Figures and Tables from this paper


A brief survey of web data extraction tools
A taxonomy for characterizing Web data extraction fools is proposed, a survey of major web data extraction tools described in the literature is briefly surveyed, and a qualitative analysis of them is provided.
A Survey of Web Information Extraction Systems
This paper surveys the major Web data extraction approaches and compares them in three dimensions: the task domain, the automation degree, and the techniques used and believes these criteria provide qualitatively measures to evaluate various IE approaches.
Structured Data Extraction from the Web Based on Partial Tree Alignment
  • Yanhong Zhai, B. Liu
  • Computer Science
    IEEE Transactions on Knowledge and Data Engineering
  • 2006
A novel and effective technique to perform the task of Web data extraction automatically, called DEPTA, which consists of two steps: identifying individual records in a page and aligning and extracting data items from the identified records.
Web Data Extraction, Applications and Techniques: A Survey
Web data extraction based on partial tree alignment
Experimental results using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately.
Three classes of criteria are presented that are capable of determining why an IE system fails to handle some Web sites of particular structures and of measuring the degree of automation for IE systems.
Web classification using extraction and machine learning techniques
This paper discusses the result of classifying web document using the extraction and machine learning techniques and shows that linear kernel technique is the best in web document classification compared to RBF, polynomial and sigmoid.
DOM tree based approach for Web content extraction
  • B. Mehta, M. Narvekar
  • Computer Science
    2015 International Conference on Communication, Information & Computing Technology (ICCICT)
  • 2015
The system will be able to extract 75%-80% user relevant content by eliminating noisy content from the different structured web pages like blogs, forums, articles etc. in the dynamic environment.
DEPTA: An efficient technique for web data extraction and alignment
The proposed system is based on identification of data records, extraction of data values and arranging these data values in a database and uses the partial tree alignment method for giving the better alignment outcome.
A hybrid method for Web data extraction
  • Yu Wang, Lizhu Zhou
  • Computer Science
    Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)
  • 2003
This work presents a hybrid algorithm: bi-direction data extraction (BiDDE for short), which takes the full strengths of both top-down and bottom-up algorithms and yet avoid their weaknesses.