A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration

@article{Zhao2012ABA,
  title={A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration},
  author={Bo Zhao and Benjamin I. P. Rubinstein and Jim Gemmell and Jiawei Han},
  journal={Proc. VLDB Endow.},
  year={2012},
  volume={5},
  pages={550-561}
}
In practical data integration systems, it is common for the data sources being integrated to provide conflicting information about the same entity. Consequently, a major challenge for data integration is to derive the most complete and accurate integrated records from diverse and sometimes conflicting sources. We term this challenge the truth finding problem. We observe that some sources are generally more reliable than others, and therefore a good model of source quality is the key to solving… 
A Probabilistic Model for Estimating Real-valued Truth from Conflicting Sources
TLDR
This work proposes a new truth-finding method specially designed for handling numerical data based on Bayesian probabilistic models that can leverage the characteristics of numerical data in a principled way, when modeling the dependencies among source quality, truth, and claimed values.
Domain-aware multi-truth discovery from conflicting sources
TLDR
This work proposes an integrated Bayesian approach to incorporate the domain expertise of data sources and confidence scores of value sets, aiming to find multiple possible truths without any supervision, and demonstrates the feasibility, efficiency and effectiveness of this approach.
A Confidence-Aware Approach for Truth Discovery on Long-Tail Data
TLDR
A confidence-aware truth discovery (CATD) method to automatically detect truths from conflicting data with long-tail phenomenon is proposed, which outperforms existing state-of-the-art truth discovery approaches by successful discounting the effect of small sources.
Truth Discovery via Exploiting Implications from Multi-Source Data
TLDR
This paper exploits three types of implications, namely the implicit negative claims, the distribution of positive/negative claims, and the co-occurrence of values in sources' claims, to facilitate multi-truth discovery in truth discovery.
MOSTER: A Novel Truth Discovery Method for Multiple Conflicting Information
TLDR
A Multi-prOperty-cluSTERing-based method, abbreviated MOSTER, in order to search for the most reliable source and identify the truth, indicating a great advancement in truth discovery studies.
Influence-Aware Truth Discovery
TLDR
An unsupervised probabilistic model named IATD is proposed, which takes source correlations as prior for influence derivation and introduces "claim trustworthiness", which fuses thetrustworthiness of the source which provides the claim and the trustworthiness of its influencers.
An Effective and Efficient Truth Discovery Framework over Data Streams
TLDR
A novel framework to conduct truth discovery over streams, which incorporates various iterative methods to effectively estimate the source weights, and decides the frequency of source weight computation adaptively and a novel scheme called adaptive source reliability assessment (ASRA), which converts an estimation problem into an optimization problem.
Modeling Truth Existence in Truth Discovery
TLDR
This work proposes a probabilistic graphical model, which simultaneously infers truth as well as source quality without any a priori training involving ground truth answers, and proposes an initialization scheme based upon a quantity named truth existence score, which synthesizes two indicators, namely, participation rate and consistency rate.
...
...

References

SHOWING 1-10 OF 15 REFERENCES
Semi-supervised truth discovery
TLDR
This paper proposes a semi-supervised approach that finds true values with the help of ground truth data and derives the optimal solution to the problem and provides an iterative algorithm that converges to it.
Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources
TLDR
A probabilistic model to compute a probability distribution for the extracted values, and the accuracy of the sources, is developed, which considers the presence of sources that copy their contents from other sources, and manages the misleading consensus produced by copiers.
Integrating Conflicting Data: The Role of Source Dependence
TLDR
This paper applies Bayesian analysis to decide dependence between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information and extends the model by considering accuracy of data sources and similarity between values.
CoBayes: bayesian knowledge corroboration with assessors of unknown areas of expertise
TLDR
This work proposes a joint probabilistic model of the truth values of statements and the expertise of users for assessing statements, and demonstrates the viability of CoBayes in comparison to other approaches, on realworld datasets and user feedback collected from Amazon Mechanical Turk.
Truth Discovery and Copying Detection in a Dynamic World
TLDR
A Hidden Markov Model that decides whether a source is a copier of another source and identifies the specific moments at which it copies is developed, and a Bayesian model that aggregates information from the sources to decide the true value for a data item, and the evolution of the true values over time is developed.
Corroborating information from disagreeing views
TLDR
It is believed that corroboration can serve in a wide range of applications such as source selection in the semantic Web, data quality assessment or semantic annotation cleaning in social networks, and this work sets the bases for a widerange of techniques for solving these more complex problems.
Using Probabilistic Information in Data Integration
TLDR
This paper addresses the problem of ordering accesses to multiple information sources, in order to maximize the likelihood of obtaining answers as early as possible, and describes a declarative formalism for specifying several kinds of probabilistic information.
Truth Discovery with Multiple Conflicting Information Providers on the Web
TLDR
This paper designs a general framework for the Veracity problem and invent an algorithm, called TRUTHFlNDER, which utilizes the relationships between websites and their information, i.e., a website is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy websites.
Knowing What to Believe (when you already know something)
TLDR
This work introduces a framework for incorporating prior knowledge into any fact-finding algorithm, expressing both general "common-sense" reasoning and specific facts already known to the user as first-order logic and translating this into a tractable linear program.
SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement
TLDR
The relevance evaluations show that SourceRank improves precision by 22-60% over the Google Base and the other baseline methods, and it is demonstrated that the SourceRank tracks source corruption.
...
...