The RDF-based Information Capturing System from Web Pages

Abstract

It is an investigative purpose to acquire the event information in the municipality website and extraction information is converted into the XML form of the RDF model. There is a problem that the extraction performance is controlled by the structure of the HTML tag though there is Web-wrapper method that uses the HTML tag as an information extraction technique on the Web page. In this paper, we propose an extraction method from a HTML document based on dictionary. HTML tag is deleted from the HTML document and it converts it into the text. It proposes the method for extracting a target character string by comparing the text with the collection of words prepared beforehand. Finally, extraction information is converted into the XML form of the RDF model. The evaluation experiment was done to the municipality in 23 Tokyo district and 56 Chiba prefecture in Japan. The proposal method was able to extract event information on as a whole 73\%. The LR-Wrapper was 52\%. The Tree-Wrapper was 55\%. The PLR-Wrapper was 32\%. The proposal method confirmed event information was rating higher than an existing method extractive by the combination of a simple algorithm and the collection of words.

DOI: 10.1109/3PGCIC.2010.34

3 Figures and Tables

Cite this paper

@article{Ushioda2010TheRI, title={The RDF-based Information Capturing System from Web Pages}, author={Tatsuya Ushioda and Shigeru Fujita}, journal={2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing}, year={2010}, pages={201-206} }