• Corpus ID: 60831687

Maintaining an Online Bibliographical Database: The Problem of Data Quality

  title={Maintaining an Online Bibliographical Database: The Problem of Data Quality},
  author={Michael Ley and Patrick Reuther},
CiteSeer and Google-Scholar are huge digital libraries which provide access to (computer-)science publications. Both collections are operated like specialized search engines, they crawl the web with little human intervention and analyse the documents to classifiy them and to extract some metadata from the full texts. On the other hand there are traditional bibliographic data bases like INSPEC for engineering and PubMed for medicine. For the field of computer science the DBLP service evolved… 
Methods for Extracting Meta-Information from bibliographic databases
This work builds a statistical system for the language identification of personal names using the framework of the sociolinguistic analysis and shows that extension of a purely statistical model with the co-authors network boosts the system’s performance.
Visualization of association graphs for assisting the interpretation of classifications
It is shown that co-author and terminological graphs of high quality can be very easily extracted from PASCAL database, visualized and browsed and founded out that special problems of person names can be managed using simple heuristics.
Integration and Warehousing of Social Metadata for Search and Assessment of Scientific Knowledge
This paper discusses the opportunities and challenges of integration for the purpose of facilitating the discovery and evaluation of scientific knowledge, and presents a framework for integration and warehousing of both bibliographic and social scientific metadata.
Disambiguating publication venue titles using association rules
The disambiguator is a supervised learning method that uses the authority file to train a classifier, whose generated model is a set of association rules to identify publication venues.
Discovering and Analyzing Scientific Communities using Conference Network
The approach presented in this thesis combines different clustering algorithms for detecting overlapped scientific communities, based on conference publication data, and shows that using the approach makes it possible to automatically produce community structure close to human-defined classification of conferences.
Towards structured representation of academic search results
A novel method of representing academic search results with concise and informative topic maps, based on sequential prediction to automatically learn to build informative summaries from examples, and an interactive learning method for selecting the categories of Wikipedia relevant to a given domain.
Sieving publishing communities in DBLP
  • Christoph Schommer
  • Art
    2008 Third International Conference on Digital Information Management
  • 2008
DBLP is a bibliographic database with more than one million data entries, collected from the last 70 years, and labeled with diverse attributes like the authorspsila names, the publication title, and
Your Personal, Virtual Librarian
Reports of original data should include an abstract of no more than 300 words using the following headings: Context, Objective, Design, Setting, Patients (or Participants), Interventions, Results, and Conclusions.
Automating Document Annotation Using Open Source Knowledge
This paper has used crowd-source knowledge bases like Wikipedia and WikiCFP for automating key phrase generation and developed a global context based key-phrase identification approach that generates its global context information using academic search engines like Google Scholar.
OntologyNavigator: WEB 2.0 scalable ontology based CLIR portal to IT scientific corpus for researchers
The architecture used in the ongoing OntologyNavigator project is presented, a research tool to help advanced learners to find adapted IT papers to create scientific bibliographies and an ontology translation in French is automatically proposed.


The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives
The most time-consuming task for the maintainers of DBLP may be viewed as a special instance of the authority control problem: how to normalize different spellings of person names.
Browsing and visualizing digital bibliographic data
An overview of some important research issues within the field of bibliographical information retrieval and visualization within the DBLP (Digital Bibliography & Library Project) Computer Science Bibliography is given.
Cleaning the spurious links in data
Comparing context information between data records can help solve the data quality problem of spurious links-that is, multiple links between data entries and real-world entities.
Comparative study of name disambiguation problem using a scalable blocking-based framework
This study identifies combinations that are scalable and effective to disambiguate author names in citations based on a scalable two-step framework and presents extensive experimental results.
A hierarchical naive Bayes mixture model for name disambiguation in author citations
This paper presents a hierarchical naive Bayes mixture model, an unsupervised learning approach, for name disambiguation in author citations, which partitions a collection of citations1 into clusters, with each cluster containing only citations authored by the same author, thusdisambiguating authorship in citations to induce author name identities.
Co-authorship networks in the digital library research community
On six degrees of separation in DBLP-DB and more
An extensive bibliometric study on the db community using the collaboration network constructed from DBLP data is presented. Among many, we have found that (1) the average distance of all db scholars
Social Networks Applied
The authors investigate the following areas concerning social networks: how to exploit their unprecedented wealth of data and how to mine social networks for purposes such as marketing campaigns; social networks as a particular form of influence; the way that people agree on terminology and this phenomenon's implications for the way the authors build ontologies and the Semantic Web.
Adaptive Name Matching in Information Integration
The authors compare and describe methods for combining and learning textual similarity measures for name matching that are essential for information integration.
Data quality for the information age
This comprehensive book provides business leaders, process owners, and information professionals with the background and methods necessary to set up a data quality program, make and sustain order of magnitude improvements, and create a unique and important business advantage.