Extracting General Lists from Web Documents: A Hybrid Approach

  title={Extracting General Lists from Web Documents: A Hybrid Approach},
  author={Fabio Fumarola and Tim Weninger and Rick Barber and Donato Malerba and Jiawei Han},
The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of a Web page independently do not generalize well. We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the… CONTINUE READING
Highly Cited
This paper has 19 citations. REVIEW CITATIONS

From This Paper

Figures, tables, and topics from this paper.


Publications citing this paper.
Showing 1-10 of 13 extracted citations

Similar Papers

Loading similar papers…