UNSUPERVISED APPROACH FOR SEMI-STRUCTURED DATA RECORD EXTRACTION FROM MULTIPLE PAGES USING TAG TREE SIMILARITIES

@inproceedings{Ansari2015UNSUPERVISEDAF,
  title={UNSUPERVISED APPROACH FOR SEMI-STRUCTURED DATA RECORD EXTRACTION FROM MULTIPLE PAGES USING TAG TREE SIMILARITIES},
  author={Aleem Ansari and H. B. Vasistha},
  year={2015}
}
In this paper we present a novel unsupervised approach for data records extraction from multiple similar web pages using tag tree similarities. Extracting the data records from multiple web pages consist of following sequences. We first identify the related web pages from the web source. Next we construct the DOM tree for related web pages using html parser. We then compare two or more web pages to eliminate unwanted regions such as header, menu bar, navigation bar, advertisements, etc and find… CONTINUE READING

Similar Papers

Loading similar papers…