Title extraction from bodies of HTML documents and its application to web page retrieval

@inproceedings{Hu2005TitleEF,
  title={Title extraction from bodies of HTML documents and its application to web page retrieval},
  author={Yunhua Hu and Guomao Xin and Ruihua Song and Shuming Shi and Yunbo Cao and Hang Li},
  booktitle={SIGIR},
  year={2005}
}
This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification… CONTINUE READING

Similar Papers

Loading similar papers…