An approach for the extensional integration of data sources with heterogeneous representation formats
Semistructured data occur in situations where information The growing need to integrate data from heterogeneous lacks a homogeneous structure and is incomplete. Yet, up to sources and to access data sources with irregular or incomnow the incompleteness of information has not been reflected plete contents is the main motivation for research into semiby special features of query languages for semistructured structured data models and query languages for them. Semidata. Our goal is to investigate the principles of queries that structured data do not comply with a strict schema and allow for incomplete answers. We do not present, however, are inherently incomplete. Query languages for such data a concrete query language. should ,reflect these characteristics. Queries over classical structured data models contain a number of variables and conditions on these variables. An answer is a binding of the variables by elements of the database such that the conditions are satisfied. In the present paper, we loosen this concept in so far as we allow also answers that are partial, that is, not all variables in the query are bound by such an answer. Partial answers make it necessary to refine the model of query evaluation. The first modification relates to the satisfaction of conditions: under some circumstances we consider conditions involving unbound variables as satisfied. Second, in order to prevent a proliferation of answers, we only accept answers that are maximal in the sense that there are no assignments that bind more variables and satisfy the conditions of the query. Semistructured data models have been intensively studied recently [Abi97, Bun97]. They originated with work on heterogeneous dataintegration [QRSS94, PGMW95, RU96]. Several models for representing semistructured data have been proposed together with query languages for those models such as Lore1 [AQMt97, MAGS97, QWGt96] and UnQL [BDHS96]. Further topics of research have been the design of schemas for semistructured data [BDFS97] and the extraction of schemas from the data [GW97, NAM98]. Our model of query evaluation consists of two phases, a search phase and a filter phase. Semistructured databases are essentially labeled directed graphs. In the search phase, we use a query graph containing variables to match a maximal portion of the database graph. We investigate three different semantics for query graphs, which give rise to three variants of matching. For each variant, we provide algorithms and complexity results. A particular motivation for this research has been to allow one to access heterogeneous sources on the World Wide Web in an integrated fashion by providing a view of the web as a semistructured database [AV97a, KMSS98]. For the purpose of querying the World Wide Web several query languages and Web site management tools have been proposed, such as WSQL [KS95, KS97], WebSQL [MMM97], Strudel [FFK+98], Araneus [MAM+98], and others [LSS96, AM98j. The growing use of the Web emphasizes the need of querying semistructured data and retrieving partial answers when complete answers are not found. In the filter phase, the maximal matchings resulting from the search phase are subjected to constraints, which may be weals or strong. Strong constraints require all their variables to be bound, while weak constraints do not. We describe a polynomial algorithm for evaluating a special type of queries with filter constraints and assess the complexity of evaluating other queries for several kinds of constraints. We define a simple data model that is similar to OEM [PGMW95, AQM+97], where databases are labeled directed graphs. A node represents an object, a label an attribute, and a labeled edge links a node to another one if the second node is an attribute filler for the first node. Our queries, too, are defined by graphs, which are to be matched by the database. The idea to base a query language on graphs appeared already in [CMW82]. In this paper we apply it to semistructured data. In the final part, we investigate the containment problem for queries consisting only of search constraints under the different semantics. Permission to make digital or hard copies ol‘ all or part of this work i’ol personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists. rcquircs prior specific permission andior a fee. PODS ‘99 Philadelphia PA Copyright ACM 1999 l-581 13-062-7/99/05...$5.00 In an abstract view, database queries consist of a set of variables and constraints on the variables. A solution to a query is a binding of the variables to objects in the database such that the constraints are satisfied. In order to be able to accept as solutions also assignments that do not bind all variables, we refine the structure of a query. We divide the constraints into search constraints and filter constraints. The search constraints form a labeled directed graph whose nodes are variables: they are a pattern that has to be matched by some part of the database. We are only interested in maximal matchings, because they contain max
Unfortunately, ACM prohibits us from displaying non-influential references for this paper.
To see the full reference list, please visit http://dl.acm.org/citation.cfm?id=303999.