Element Relationship: Exploiting Inline Markup for Better XML Retrieval

Abstract

With the increasing popularity of semi-structured documents (particularly in the form of XML) for knowledge management, it is important to create tools that use the additional information contained in the markup. Although research on textual XML retrieval is still in its early stages, many retrieval approaches and engines exist. The use of inline markup in these engines so far is very limited. We introduce the concept of element relationship and describe how it can improve similarity calculation. We illustrate our ideas with examples based on an existing document collection. 1 Textual XML Retrieval In traditional Information Retrieval (IR), a user has an information need and wants to obtain documents fulfilling that need from a document base [BYRN99]. The situation is essentially the same in retrieval on document-centric XML; one important difference is that documents are not assumed to be atomic units, that is, the retrieval engine should return the most specific fragments satisfying the query. The traditional IR techniques can be used for semi-structured data as well, but as they do not make use of the additional information contained in the markup, they are likely not to yield the best results possible. Because of this, new, XML-specific query languages and retrieval engines were developed. Structure-based query languages like XPath and XQuery assume that the searcher has an intimate knowledge of the documents to be queried, as they expect him to formulate queries based on the element names and nesting. A draft version of XQuery Full-Text adds a contains operator that supports comparison using standard IR techniques; the user still has to specify exact paths, however. Other query languages are closer to the ones used in traditional IR [FG01, TW02]. All these approaches use the XML markup to some extent. Markup can be used at several levels in an XML document schema: Block-level markup can be used to embed metadata (like authors’ names) and to represent the structure of the document; examples include 〈body〉 in (X)HTML and 〈section〉 in DocBook [WM99]. Inline markup is used on single words or (short) phrases to convey the meaning or intended representation of the marked-up contents.

Extracted Key Phrases

3 Figures and Tables

Cite this paper

@inproceedings{Dopichaj2005ElementRE, title={Element Relationship: Exploiting Inline Markup for Better XML Retrieval}, author={Philipp Dopichaj}, booktitle={BTW}, year={2005} }