Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis
Information on the World Wide Web is accessed not just visually, but also automatically by systems, such as search engines and alternative browsers (e.g. screen readers and voice browsers), which extract and present relevant data automatically from Web pages. In most cases extraction cannot be performed directly, since HTML documents of today lack adequate semantic markup. This thesis proposes a method that converts an HTML document to a semantically enhanced document representation, from which generic document components can be extracted for further knowledge exploration or alternative presentation. The document is parsed and iteratively smaller nodes are mapped to a classification ontology, which then are aggregated into larger segments, thereby creating a semantically enhanced parse tree. Segment boundaries are detected based on visual and document segments, such as images and headings. Experimental results of the implementation show that document components, such as headings and menus, can be extracted directly from the semantic parse tree. The heading extraction experiment achieved recall and precision rates of 88% and 91%. The recall and precision rates for the menu extraction experiment where 90%.