Automatic Web Page Segmentation and Noise Removal for Structured Extraction using Tag Path Sequences

@article{Velloso2013AutomaticWP,
  title={Automatic Web Page Segmentation and Noise Removal for Structured Extraction using Tag Path Sequences},
  author={Roberto Panerai Velloso and Carina F. Dorneles},
  journal={JIDM},
  year={2013},
  volume={4},
  pages={173-187}
}
Web page segmentation and data cleaning are essential steps in structured web data extraction. Identifying a web page main content region, removing what is not important (menus, ads, etc.), can greatly improve the performance of the extraction process. We propose, for this task, a novel and fully automatic algorithm that uses a tag path sequence (TPS) representation of the web page. The TPS consists of a sequence of symbols (string), each one representing a different tag path. The proposed… CONTINUE READING

Similar Papers

Loading similar papers…