Information Integration Using Logical Views

Abstract

A number of ideas concerning information-integration tools can be thought of as constructing answers to queries using views that represent the capabilities of information sources. We review the formal basis of these techniques, which are closely related to containment algorithms for conjunctive queries and/or Datalog programs. Then we compare the approaches taken by AT&T Labs' \Information Manifold" and the Stanford \Tsimmis" project in these terms. 1 Theoretical Background Before addressing information-integration issues, let us review some of the basic ideas concerning conjunctive queries, Datalog programs, and their containment. To begin, we use the logical rule notation from [Ull88]. Example 1. The following: p(X,Z) :a(X,Y) & a(Y,Z). is a rule that talks about a, an EDB predicate (\Extensional DataBase," or stored relation), and p, an IDB predicate (\Intensional DataBase," or predicate whose relation is constructed by rules). In this and several other examples, it is useful to think of a as an \arc" predicate de ning a graph, while other predicates de ne certain structures that might exist in the graph. That is, a(X;Y ) means there is an arc from node X to node Y . In this case, the rule says \p(X;Z) is true if there is an arc from node X to node Y and also an arc from Y to Z." That is, p represents paths of length 2. In general, there is one atom, the head, on the left of the \if" sign, :and zero of more atoms, called subgoals, on the right side (the body). The head always has an IDB predicate; the subgoals can have IDB or EDB predicates. Thus, here p(X;Z) is the head, while a(X;Y ) and a(Y; Z) are subgoals. We assume that each variable appearing in the head also appears somewhere in the body. This \safety" requirement assures that when we use a rule, we are not left with unde ned variables in the head when we try to infer a fact about the head's predicate. We also assume that atoms consist of a predicate and zero or more arguments. An argument can be either a variable or a constant. However, we exclude function symbols from arguments. ? This work was supported by NSF grant IRI{96{31952, ARO grant DAAH04{95{1{ 0192, and Air Force contract F33615{93{1{1339. 1.1 Conjunctive Queries A conjunctive query (CQ) is a rule with subgoals that are assumed to have EDB predicates. A CQ is applied to the EDB relations by considering all possible substitutions of values for the variables in the body. If a substitution makes all the subgoals true, then the same substitution, applied to the head, is an inferred fact about the head's predicate. Example 2. Consider Example 1, whose rule is a CQ. If a(X;Y ) is true exactly when there is an arc X ! Y in a graph G, then a substitution for X, Y , and Z will make both subgoals true when there are arcs X ! Y ! Z. Thus, p(X;Z) will be inferred exactly when there is a path of length 2 from X to Z in G. A crucial question about CQ's is whether one is contained in another. If Q1 and Q2 are CQ's, we say Q1 Q2 if for all databases (truth assignments to the EDB predicates) D, the result of applying Q1 to D [written Q1(D)] is a subset of Q2(D). Two CQ's are equivalent if and only if each is contained in the other. It turns out that in almost all cases, the only approach known for testing equivalence is by testing containment in both directions. Moreover, in information-integration applications, containment appears to be more fundamental than equivalence, so from here we shall concentrate on the containment test. Conjunctive queries and their containment were rst studied by Chandra and Merlin ([CM77]). Here, we shall give another test, following the approach of [R*89], because this test extends more naturally to the generalizations of the CQ-containment problem that we shall discuss. To test whether Q1 Q2: 1. freeze the body of Q1 by turning each of its subgoals into facts in the database. That is, replace each variable in the body by a distinct constant, and treat the resulting subgoals as the only tuples in the database. 2. Apply Q2 to this canonical database. 3. If the frozen head of Q1 is derived by Q2, then Q1 Q2. Otherwise, not; in fact the canonical database is a counterexample to the containment, since surely Q1 derives its own frozen head from this database. Example 3. Consider the following two CQ's: Q1: p(X,Z) :a(X,Y) & a(Y,Z). Q2: p(X,Z) :a(X,U) & a(V,Z). Informally,Q1 looks for paths of length 2, while Q2 looks only for nodes X and Z such that X has an arc out to somewhere, and Z has an arc in from somewhere. Intuitively, we expect, Q1 Q2, and that is indeed the case. In this and other examples, we shall use integers starting at 0 as the constants that \freeze" the CQ, although obviously the choice of constants is irrelevant. Thus, the canonical database D constructed from Q1 consists of the two tuples a(0; 1) and a(1; 2) and nothing else. The frozen head of Q1 is p(0; 2). If we apply Q2 to D, the substitution X ! 0, U ! 1, V ! 1, and Z ! 2 yields p(0; 2) in the head of Q2. Since this fact is the frozen head of Q1, we have veri ed Q1 Q2. Incidentally, for this containment test and the more general tests of following subsections, the argument that it works is, in brief: { If the test is negative, then the constructed database is a counterexample to the containment. { If the test is positive, then there is an implied homomorphism from the variables of Q2 to the variables of Q1. We obtain by seeing what constant each variable X of Q2 was mapped to in the successful application of Q2 to the canonical database. (X) is the variable of Q1 that corresponds to this constant. If we now apply Q1 to any database D and yield a particular fact for the head, let the homomorphism from the variables of Q1 to the database symbols that we use in this application be . Then followed by is a homomorphism from the variables of Q2 to the database symbols that shows how Q2 will yield the same head fact. This argument proves Q1 Q2. Containment of CQ's is NP-complete ([CM77]), although [Sar91] shows that in the common case where no predicate appears more than twice in the body, then there is a linear-time algorithm for containment. 1.2 CQ's With Negation An important extension of CQ's is to allow negated subgoals in the body. The e ect of applying a CQ to a database is as before, but now when we make a substitution of constants for variables the atoms in the negated subgoals must be false, rather than true (i.e., the negated subgoal itself must be true). Now, the containment test is slightlymore complex; it is complete for the class p 2 , problems that can be expressed as fwj(8x)(9y) (w; x; y)g, where strings x and y are of length bounded by a polynomial function of the length of w, and is a function that can be computed in polynomial time. This test, due to Levy and Sagiv ([LS93]), involves exploring an exponential number of \canonical" databases, any one of which can provide a counterexample to the containment. Suppose we wish to test Q1 Q2. We do the following: 1. Consider each substitution of constants for variables in the body of Q1, allowing the same constant to be substituted for two or more variables. More precisely, consider all partitions of the variables of Q1 and assign for each block of the partition a unique constant. Thus, we obtain a number of canonical databases D1; D2; : : : ; Dk, where k is the number of partitions of integer n, and n is the number of variables in the body of Q1. Each Di consists of the frozen positive subgoals of Q1 only, not the negated subgoals. 2. For each Di consider whether Di makes all the subgoals of Q1 true. Note that because the atom in a negated subgoal may happen to be in Di, it is possible that Di makes the body of Q1 false. 3. For those Di that make the body of Q1 true, test whether any Q2(D0 i) includes the frozen head of Q1, where D 0 i is any database that is a superset of Di formed by adding other tuples that use the same set of symbols as Di. However, D0 i may not include any tuple that is a frozen negative subgoal of Q1. When determining what the frozen head of Q1 is, we make the same substitution of constants for variables that yielded Di. 4. If every Di either makes the body of Q1 false or yields the frozen head of Q1 when Q2 is applied, then Q1 Q2. Otherwise, not. Example 4. Let us consider the following two conjunctive queries: Q1: p(X,Z) :a(X,Y) & a(Y,Z) & NOT a(X,Z). Q2: p(A,C) :a(A,B) & a(B,C) & NOT a(A,D). Intuitively, Q1 looks for paths of length 2 that are not \short-circuited" by a single arc from beginning to end. Q2 looks for paths of length 2 that start from a node A that is not a \universal source"; i.e., there is at least one node D not reachable from A by an arc. To show Q1 Q2 we need to consider all partitions of fX;Y; Zg. There are ve of them: one that keeps all three variables separate, one that groups them all, and three that group one pair of variables. The table in Fig. 1 shows the ve cases and their outcomes. Partition Canonical Database Outcome 1) fXgfY gfZg fa(0; 1); a(1; 2)g both yield head p(0; 2) 2) fX;Y gfZg fa(0; 0); a(0; 1)g Q1 body false 3) fXgfY;Zg fa(0; 1); a(1; 1)g Q1 body false 4) fX;ZgfY g fa(0; 1); a(1; 0)g both yield head p(0; 0) 5) fX;Y;Zg fa(0; 0)g Q1 body false Fig. 1. The ve canonical databases and their outcomes For instance, in case (1), where all three variables are distinct, and we have arbitrarily chosen the constants 0, 1, and 2 for X, Y , and Z, respectively, the canonical database D1 is the two positive subgoals, frozen to be a(0; 1) and a(1; 2). The frozen negative subgoal NOT a(0; 2) is true in this case, since a(0; 2) is not in D1. Thus, Q1 yields its own head, p(0; 2), and we must test that Q2 does likewise on any database consisting of symbols 0, 1, and 2, that includes the two tuples of D1 and does not include the tuple a(0; 2), the frozen negative subgoal of Q1. If we use the substitution A ! 0, B ! 1, C ! 2, and D ! 2, then the positive subgoals become true for any such superset ofD1. The negative subgoal becomes NOT a(0; 2), and we have explicitly excluded a(0; 2) from any of these databases. We conclude that the Levy-Sagiv test holds for case (1). Now consider case (2), where X and Y are equated and Z is di erent. We have chosen to use 0 for X and Y ; 1 for Z. Then the canonical database for this case is D2, consisting of the frozen positive subgoals a(0; 0) and a(0; 1). For this substitution, the negative subgoal of Q1 becomes NOT a(0; 1). Since a(0; 1) is in D2, this subgoal is false. Thus, for this substitution of constants for variables in Q1, we do not even derive the head of Q1. We need check no further in this case; the test is satis ed. The three remaining cases must be checked as well. However, as indicated in Fig. 1, in each case either both CQ's yield the frozen head of Q1 or Q1 does not yield its own frozen head. Thus, the test is completely satis ed, and we conclude Q1 Q2. 1.3 CQ's With Arithmetic Comparisons Another important extension of CQ-containment theory is the inclusion of arithmetic comparisons as subgoals. In this regard we must consider the set of values in the database as belonging to a totally ordered set, e.g., the integers or reals. When we consider possible assignments of integer constants to the variables of conjunctive query Q1, we may use consecutive integers, starting at 0, but now we must consider not only partitions of variables into sets of equal value, but among the blocks of the partition, we must consider the relative order of their values. The canonical database is constructed from those subgoals that have nonnegated, uninterpreted predicates only, not those with a negation or a comparison operator. If there are negated subgoals, then we must also consider certain supersets of the canonical databases, as we did in Section 1.2. But if there are no negated subgoals, then the canonical databases alone su ce. Example 5. Now consider the following two conjunctive queries, each of which refers to a graph in which nodes are assumed to be integers. Q1: p(X,Z) :a(X,Y) & a(Y,Z) & X<Y. Q2: p(A,C) :a(A,B) & a(B,C) & A<C. Both ask for paths of length 2. But Q1 requires that the rst node be numerically less than the second, while Q2 requires that the rst node be numerically less than the third. The number of di erent canonical databases is 13. We must consider the ve di erent partitions of fX;Y; Zg, as we did in Fig. 1. However, we also have to order the blocks of each partition. For partition (1) of Fig. 1, where each variable is separate, we have 6 possible orders of the blocks. For partitions (2) through (4), where there are only two blocks, we have 2 di erent orders. Finally, for partition (5), with only one block, there is one order. In this example, the containment test fails. We have only to nd one of the 13 cases to show failure. For instance, consider X = Z = 0 and Y = 1. The canonical database D for this case is fa(0; 1); a(1; 0)g, and since X < Y , the body of Q1 is true. Thus, Q2(D) must include the frozen head of Q1, p(0; 0). However, no assignment of values to A, B, and C makes all three subgoals of Q2 true, when D is the database. That is, in order to make subgoals a(A;B) and a(B;C) both true for D, we surely must use 0 or 1 for all of A, B, and C. Then to make A < C true, we must have A = 0 and C = 1. But then, whether B is 0 or 1 we shall have in Q2 a subgoal a(0; 0) or a(1; 1), neither of which is in D. Thus, D is a counterexample to Q1 Q2. The containment test for CQ's with arithmetic is from [Klug88], and [vdM92] shows that the problem of testing containment for CQ's with arithmetic comparisons is complete for p 2 , at least in the case of a dense domain such as the reals. [LS93] actually includes arithmetic comparisons in their work on negation, and we should note that the above technique works even if there are negated subgoals as well as arithmetic comparisons. There is a more general approach that works for any interpreted predicates, not just a predicate like < or that forms a total order; it appears in [ZO93]. However, this technique does not include CQ's with negated subgoals. 1.4 Datalog Programs Let us now return to the original model of rules, excluding negated subgoals and arithmetic comparisons. However, we shall now consider collections of rules, which we call a Datalog program. Such collections of rules have a natural, leastxedpoint interpretation, where we start by assuming the IDB predicates have empty relations. We then use the rules to infer new IDB facts, until no more facts can be inferred. More on the semantics of Datalog, including e cient algorithms for evaluating the IDB predicates, can be found in [Ull88], [Ull89]. While we shall not discuss Datalog with negated subgoals here, because the meaning is debatable in some cases, the principal ideas are surveyed in [Ull94]. Here is an example of a Datalog program and its semantics. Example 6. Consider the three rules: 1) p(X,Z) :q(X,Y) & b(Y,Z). 2) q(X,Y) :a(X,Y). 3) q(X,Z) :a(X,Y) & p(Y,Z). Intuitively, think of a graph with two kinds of arcs: \a-arcs" and \b-arcs." Then p and q represent certain kinds of paths. Rule (1) says that a q-path followed by a b-arc is a p-path. Rule (2) says that a single a-arc is a q-path, while rule (3) says that a-arcs followed by p-paths are also q-paths. It may not be obvious what is going on, but one can prove by an easy induction that the p-paths consist of some number n 1 of a-arcs followed by an equal number of b-arcs. A q-path is the same, except it has one fewer b-arc. To get a feel for why this claim holds, consider a particular graph G described by the a and b EDB predicates. Then rule (2) says all the paths a are in the relation for q. We can therefore use rule (1) to infer that any path of the form ab is in the relation for p; more precisely, if there is a path from node X to node Z that follows an a-arc and then a b-arc, p(X;Z) is true. Next, rule (3) tells us that any path of the form aab is a q-path; rule (1) says paths of the form aabb are p-paths, and so on. Containment questions involving Datalog programs are often harder than for CQ's. [Shm87] shows that containment of Datalog programs is undecidable, while [CV92] shows that containment of a Datalog program in a CQ is doubly exponential. However, the important case for purposes of information integration is the containment of a CQ in a Datalog program, and this question turns out to be no more complex than containment of CQ's ([R*89]). To test whether CQ Q is contained in Datalog program P , we \freeze" the body of Q, just as we did in Section 1.1, to make a canonical database D. We then see if P (D) contains the frozen head of Q. The only signi cant di erence between containment in a CQ and containment in a Datalog program is that in the latter case we must keep applying the rules until either the head is derived, on no more IDB facts can be inferred. Example 7. Consider the Datalog program from Example 6, which we shall call P , and the CQ Q: p(A,C) :a(A,B) & b(B,C). Freezing the body of Q, we obtain the canonical database D = fa(0; 1); b(1; 2)g. Now, we apply P to D. Rule (2) lets us infer q(0; 1) from a(0; 1). Then, rule (1) lets us infer p(0; 2) from q(0; 1) and b(1; 2). Since p(0; 2) is the frozen head of Q, our test has concluded positively; Q P . 2 Synthesizing Queries From Views Query containment algorithms connect to information integration via a concept called \synthesizing queries from views." The idea, originally studied by [YL87] and [C*95], is suggested in Fig. 2. There are a number of \EDB" predicates, for which we use p's in Fig. 2. These predicates, which are not truly EDB predicates since they usually don't exist as physically stored relations, can be thought of as representing the basic concepts used in queries. There are also views, denoted by v's in Fig. 2, that represent resources that the integrator uses internally to help answer queries. Each view has a de nition in terms of the EDB predicates, and we suppose here that these de nitions are conjunctive queries. 2.1 Solving Queries by Views A query Q is expressed in terms of the EDB predicates, the p's. Our problem is to nd a \solution" S for the query Q. A solution is an expression (also a CQ in the gure) in terms of the views. In order to be a valid solution, when we replace the views in S by their de nitions, we get an expansion query E, which must be equivalent to the original query Q. An alternative formulation of the query-synthesis problem is to ask for all solutions S whose expansion E is contained in Q (perhaps properly contained). \The solution" for Q is then the union of all these partial solutions. answer( ) :pj11 : : : pj1k1 pjr1 : : : pjrkr Expansion E Solution S answer( ) :vj1( ) & . . . & vjr( ) Query Q answer( ) :pi1( ) & . . . & pin( ) Fig. 2. Constructing a query from views Example 8. We shall consider an example that illustrates some technical points, but su ers in realism for the sake of these points. Let us suppose that there is a single EDB predicate p(X;Y ) which we interpret to mean that Y is a parent of X. Let there be two views, de ned as follows: v1(Y,Z) :p(X,Y) & p(Y,Z). v2(X,Z) :p(X,Y) & p(Y,Z). Note that the views have the same body but di erent heads. The rst view, v1, actually produces a subset of the relation for p: those child-parent pairs (Y; Z) such that the child is also a parent of some individual X. The second view, v2, produces a straightforward grandparent relation from the parent relation. Suppose that we want to query this information system for the great grandparents of a particular individual, whom we denote by the constant 0. This query is expressed in terms of the EDB predicate p by q(C) :p(0,A) & p(A,B) & p(B,C). Our problem is to nd a CQ whose subgoals use only the predicates v1 and v2 and whose expansion is equivalent to the query above. A bit of thought tells us that s1(C) :v2(0,D) & v1(D,C). is a solution. That is, if we replace each of the subgoals of s1 by the de nition of the views (being careful to use unique variables in place of those variables that appear in the bodies of the view de nitions but not in the heads of those de nitions), we get the expansion: e1(C) :p(0,E) & p(E,D) & p(F,D) & p(D,C). We can use the CQ containment test in both directions to prove that e1 q. Intuitively, the subgoal p(F;D) in e1 is super uous, since every time there is binding for E and D that makes p(E;D) true, we can bind F to the same value as E and make p(F;D) true. There are other solutions that, when expanded, are contained within q, but are not equivalent to it. Some examples are: s2(C) :v1(0,D) & v2(D,C). s3(C) :v1(0,D) & v1(D,E) & v1(E,C). s4(C) :v2(0,D) & v1(D,C) & v2(C,E). Solution s2 is equivalent to q if individual 0 has a child in the database. Otherwise, 0 cannot appear as a rst component in the relation for v1, and the result of s2 is empty. Thus, s2 q, but not conversely. Solution s3 is actually equivalent to s2, while s4 gives those great grandparents of individual 0 who are themselves grandchildren. 2.2 Minimal-Solution Theorems It might appear from Example 8 that one can only guess potential solutions for a query and test them via CQ-containment tests. However, there are theorems that limit the search and show that the problem of expressing a query in terms of views, while NP-complete, is no worse than that. As discussed in Section 1.1, we expect that queries will be short, so NP-complete problems are unlikely to be a major bottleneck in practice. The principal idea is that any view used in a solution must serve some function in the query; a view without a function may be deleted from the solution. For example, every subgoal of the query must be covered by some view. The question of when a view covers a query subgoal is a bit subtle, because two or more views may cover the same subgoal. For instance, consider Example 8, where both p(E;D) and p(F;D) from expansion e1 \cover" p(A;B) from the query. More precisely, A, E, and F may each represent a parent of individual 0, while B and D represent a parent of that parent. Note that p(E;D) and p(F;D) come from the expansion of v2(0; D) and v1(D;C), respectively, in solution s1, so these two subgoals from di erent views each play the same role in the expansion. Let us de ne a solution S for a query Q to be minimal if 1. S Q. 2. There is no solution T for Q such that (a) S T Q, and (b) T has fewer subgoals than S. Theorem1. ([L*95]) If queries are CQ's without negation, arithmetic comparisons, or constants in the body, then every minimal conjunctive-query solution for a query Q has no more subgoals (uses of views) than Q has subgoals. Theorem2. ([RSU95]) If queries are CQ's without negation or arithmetic comparisons (but with constants in the body permitted, as in Example 8), then every minimal CQ-solution for a query Q has no more subgoals than the sum of the number of subgoals and number of variables in Q. Both Theorems [L*95] and [RSU95] o er nondeterministic polynomial-time algorithms to nd either 1. A single solution equivalent to the query Q, or 2. A set of solutions whose union is contained in Q and that contains any other solution that is contained in Q. In each case, one searches \only" an exponential number (as a function of the length of Q) of minimal queries. If we are looking for one solution equivalent to Q, then we may stop if we nd one, and we conclude there is none if we have searched all solutions of length up to the bound and found none. If we want all solutions contained in the query, then we search all up to the bound, taking those that are contained in Q. 3 Information-Integration Systems Information integration has long been recognized as a central problem of modern database systems. While early databases were self-contained, it is now generally realized that there is great value in taking information from various sources and making them work together as a whole. Yet there are several di cult problems to be faced: { \Legacy" databases cannot be altered just because we wish to support a new, integrating application above them. { Databases that ostensibly deal with the same concepts may have di erent shades of meaning for the same term, or use di erent terms for the same concept. { Information sources, such as those on the \web," may have no xed schema or a time-varying schema. A common integration architecture is shown in Fig. 3. Several sources are wrapped by software that translates between the source's local language, model, and concepts and the global concepts shared by some or all of the sources. System components, here called mediators ([Wie92]), obtain information from one or more components below them, which may be wrapped sources or other mediators. Mediators also provide information to components above them and to external users of the system. In a sense, a mediator is a view of the data found in one or more sources. Data does not exist at the mediator, but one may query the mediator as if it were stored data; it is the job of the mediator to go to its sources and nd the answer to the query. Today, the components labeled \mediator" in Fig. 3 are unlikely to be true mediators, but rather data warehouses. If a mediator is like a view, then a warehouse is like a materialized view. That is, the warehouse holds data that is constructed from the data at the sources. The warehouse is queried directly, with

DOI: 10.1007/3-540-62222-5_34

Extracted Key Phrases

4 Figures and Tables

050'97'99'01'03'05'07'09'11'13'15'17
Citations per Year

1,107 Citations

Semantic Scholar estimates that this publication has 1,107 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@article{Ullman1997InformationIU, title={Information Integration Using Logical Views}, author={Jeffrey D. Ullman}, journal={Theor. Comput. Sci.}, year={1997}, volume={239}, pages={189-210} }