Automatic Discovery of Semantic Structures in HTML Documents

  title={Automatic Discovery of Semantic Structures in HTML Documents},
  author={Saikat Mukherjee and Guizhen Yang and Wenfang Tan and I. V. Ramakrishnan},
Template-driven HTML documents posses an implicit, fixed schema denoting concepts and their relationships in a hierarchical fashion. Discovering this schema remains a relatively unexplored problem. By exploiting a key observation that semantically related items in HTML documents exhibit spatial locality, we develop an algorithm for automatically partitioning them into tree-like semantic structures which expose the implicit schema. 
Highly Cited
This paper has 51 citations. REVIEW CITATIONS

From This Paper

Figures, tables, and topics from this paper.


Publications citing this paper.
Showing 1-10 of 32 extracted citations

Heading-based sectional hierarchy identification for HTML documents

2007 22nd international symposium on computer and information sciences • 2007
View 4 Excerpts
Highly Influenced

Precise web page segmentation based on semantic block headers detection

6th International Conference on Digital Content, Multimedia Technology and its Applications • 2010
View 1 Excerpt

52 Citations

Citations per Year
Semantic Scholar estimates that this publication has 52 citations based on the available data.

See our FAQ for additional information.

Similar Papers

Loading similar papers…