Detecting data and schema changes in scientific documents

  title={Detecting data and schema changes in scientific documents},
  author={Nabil R. Adam and Igg Adiwijaya and Terence Critchlow and Ron Musick},
  journal={Proceedings IEEE Advances in Digital Libraries 2000},
Data stored in a data warehouse must be kept consistent and up-to-date with respect to the underlying information sources. By providing the capability to identify, categorize and detect changes in these sources, only the modified data needs to be transferred and entered into the warehouse. Another alternative, periodically reloading from scratch, is obviously inefficient. When the schema of an information source changes, all components that interact with, or make use of data originating from… 

Figures from this paper

Synchronizing XPath views
  • D. Pedersen, T. Pedersen
  • Computer Science
    Proceedings. International Database Engineering and Applications Symposium, 2004. IDEAS '04.
  • 2004
This work presents techniques for discovering schema changes in XML data sources and synchronizing XPath-based views to reflect these schema changes, and is the first presented technique for synchronizing views over XML data.
Achieving adaptivity for OLAP-XML federations
The potential problems that may interrupt the operation of the integration system, in particular those caused by the often autonomous and unreliable nature of external XML data sources, are described, and methods for handling these problems are described.
BSML: A Binding Schema Markup Language for Data Interchange in Problem Solving Environments (PSEs)
BSML is designed to integrate with a PSE or application composition system that views model specification and execution as a problem of managing semistructured data.
SI in digital libraries
Research challenges involved with SI that are specific to digital libraries are identified, the available solutions that meet the digital library requirements and thedigital library prototypes are discussed, and the environmental digital library system under development at the Cen-Issues in SI is described.
A management Technique for Protein Version Information based on Local Sequence Alignment and Trigger
A technique of managing protein version sequences based on local sequence alignment and a technique of manage protein historical reference data using Trigger are proposed, which automatically determines the similarity between an original sequence and each version sequence while the Protein version sequences are stored into database.
Intelligent Tickers : An Information Integration Scheme for Active Information Gathering
Intelligent Ticker is a system called Intelligent Ticker that consists of multiple information gathering modules and an information integration module that produces Tickers based on the difference between an updated Web page and the original one.
Data and Computation Modeling for Scientific Problem Solving Environments
This thesis investigates several issues in data and computation modeling for scientific problem solving environments (PSEs) and emphasizes data modeling and management, two important aspects that have been largely neglected in modern PSE research.


DataFoundry: information management for scientific data
The paper discusses issues within the context of the DataFoundry project, an ongoing research effort at Lawrence Livermore National Laboratory that utilizes a unique integration strategy to identify corresponding instances while maintaining differences between data from different sources, and a novel architecture and an extensive meta-data infrastructure, which reduce the cost of maintaining a warehouse.
Change detection in hierarchically structured information
This work defines the hierarchical change detection problem as the problem of finding a "minimum-cost edit script" that transforms one data tree to another, and presents efficient algorithms for computing such an edit script.
Representing and querying changes in semistructured data
A model for representing changes in semistructured data and a language for querying over these changes are presented and an important feature of this approach is that it represents and query changes directly as annotations on the affected data, instead of indirectly as the difference between database states.
Extracting schema from semistructured data
It is established that the general problem of finding an optimal form of semistructured data based on labeled, directed graphs is NP-hard, but some heuristics and techniques based on clustering that allow efficient and near-optimal treatment of the problem are presented.
Representative objects: concise representations of semistructured, hierarchical data
Introduces the concept of representative objects, which uncover the inherent schema(s) in semi-structured, hierarchical data sources and provide a concise description of the structure of the data.
Semistructured data
A number of issues surrounding semistructured data are covered: finding a concise formulation, building a sufficiently expressive language for querying and transformation, and optimizat,ion problems.
The TSIMMIS Project: Integration of Heterogeneous Information Sources
An overview of the Tsimmis Project is given, describing components that extract properties from unstructured objects, that translate information into a common object model, that combine information from several sources, that allow browsing of information, and that manage constraints across heterogeneous sites.
Querying Semi-Structured Data
The main purpose of the paper is to isolate the essential aspects of semistructured data, and survey some proposals of models and query languages for semi-structured data.
Tracking and Viewing Changes on the Web
A set of tools that detect when World-Wide-Web pages have been modified and present the modifications visually to the user through marked-up HTML are described, with an emphasis on systems issues such as scalability, security, and error conditions.