A pipeline for extracting and deduplicating domain-specific knowledge bases

@article{Kejriwal2015APF,
  title={A pipeline for extracting and deduplicating domain-specific knowledge bases},
  author={Mayank Kejriwal and Qiaoling Liu and Ferosh Jacob and Faizan Javed},
  journal={2015 IEEE International Conference on Big Data (Big Data)},
  year={2015},
  pages={1144-1153}
}
Building a knowledge base (KB) describing domain-specific entities is an important problem in industry, examples including KBs built over companies (e.g. Dun & Bradstreet), skills (LinkedIn, CareerBuilder) and people (inome). The task involves several engineering challenges, including devising effective procedures for data extraction, aggregation and deduplication. Data extraction involves processing multiple information sources in order to extract domain-specific data instances. The extracted… CONTINUE READING

Similar Papers

Figures, Tables, Results, and Topics from this paper.

Key Quantitative Results

  • Using an independent authoritative list of public companies, we show that the extracted KB achieves coverage of almost 60%, even by pessimistic estimates.
  • The AC blocking method is found to achieve an f-score performance over 90% on all three datasets, while tokenbased features and a random forest classifier are found to work best for the matching step.

Citations

Publications citing this paper.

References

Publications referenced by this paper.
SHOWING 1-10 OF 24 REFERENCES

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

  • IEEE Transactions on Knowledge and Data Engineering
  • 2012
VIEW 7 EXCERPTS
HIGHLY INFLUENTIAL

DBpedia: A Nucleus for a Web of Open Data

  • ISWC/ASWC
  • 2007
VIEW 4 EXCERPTS
HIGHLY INFLUENTIAL

Duplicate Record Detection: A Survey

  • IEEE Transactions on Knowledge and Data Engineering
  • 2007
VIEW 5 EXCERPTS
HIGHLY INFLUENTIAL

The Merge/Purge Problem for Large Databases

VIEW 3 EXCERPTS
HIGHLY INFLUENTIAL

Carotene: A Job Title Classification System for the Online Recruitment Domain

  • 2015 IEEE First International Conference on Big Data Computing Service and Applications
  • 2015
VIEW 3 EXCERPTS

Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions

  • IEEE Transactions on Knowledge and Data Engineering
  • 2015
VIEW 1 EXCERPT