Title extraction from bodies of HTML documents and its application to web page retrieval

@inproceedings{Hu2005TitleEF,
  title={Title extraction from bodies of HTML documents and its application to web page retrieval},
  author={Yunhua Hu and Guomao Xin and Ruihua Song and Guoping Hu and Shuming Shi and Yunbo Cao and Hang Li},
  booktitle={SIGIR '05},
  year={2005}
}
  • Yunhua Hu, Guomao Xin, +4 authors Hang Li
  • Published in SIGIR '05 2005
  • Computer Science
  • This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification… CONTINUE READING

    Create an AI-powered research feed to stay up to date with new papers like this posted to ArXiv

    Citations

    Publications citing this paper.
    SHOWING 1-10 OF 48 CITATIONS

    Using linguistic features to automatically extract web page title

    VIEW 36 EXCERPTS
    CITES BACKGROUND
    HIGHLY INFLUENCED

    Inferring Structure and Meaning of Semi-Structured Documents by using a Gibbs Sampling Based Approach

    VIEW 1 EXCERPT

    The Determination of Cluster Number at k-Mean Using Elbow Method and Purity Evaluation on Headline News

    VIEW 1 EXCERPT
    CITES BACKGROUND

    Content-based Title Extraction from Web Page

    VIEW 1 EXCERPT
    CITES METHODS

    References

    Publications referenced by this paper.