Corpus ID: 60831687

Maintaining an Online Bibliographical Database: The Problem of Data Quality

@inproceedings{Ley2006MaintainingAO,
  title={Maintaining an Online Bibliographical Database: The Problem of Data Quality},
  author={Michael Ley and Patrick Reuther},
  booktitle={EGC},
  year={2006}
}
CiteSeer and Google-Scholar are huge digital libraries which provide access to (computer-)science publications. Both collections are operated like specialized search engines, they crawl the web with little human intervention and analyse the documents to classifiy them and to extract some metadata from the full texts. On the other hand there are traditional bibliographic data bases like INSPEC for engineering and PubMed for medicine. For the field of computer science the DBLP service evolved… Expand

Topics from this paper

Methods for Extracting Meta-Information from bibliographic databases
TLDR
This work builds a statistical system for the language identification of personal names using the framework of the sociolinguistic analysis and shows that extension of a purely statistical model with the co-authors network boosts the system’s performance. Expand
Visualization of association graphs for assisting the interpretation of classifications
TLDR
It is shown that co-author and terminological graphs of high quality can be very easily extracted from PASCAL database, visualized and browsed and founded out that special problems of person names can be managed using simple heuristics. Expand
Integration and Warehousing of Social Metadata for Search and Assessment of Scientific Knowledge
TLDR
This paper discusses the opportunities and challenges of integration for the purpose of facilitating the discovery and evaluation of scientific knowledge, and presents a framework for integration and warehousing of both bibliographic and social scientific metadata. Expand
Disambiguating publication venue titles using association rules
TLDR
The disambiguator is a supervised learning method that uses the authority file to train a classifier, whose generated model is a set of association rules to identify publication venues. Expand
Discovering and Analyzing Scientific Communities using Conference Network
TLDR
The approach presented in this thesis combines different clustering algorithms for detecting overlapped scientific communities, based on conference publication data, and shows that using the approach makes it possible to automatically produce community structure close to human-defined classification of conferences. Expand
Towards structured representation of academic search results
TLDR
A novel method of representing academic search results with concise and informative topic maps, based on sequential prediction to automatically learn to build informative summaries from examples, and an interactive learning method for selecting the categories of Wikipedia relevant to a given domain. Expand
Sieving publishing communities in DBLP
  • Christoph Schommer
  • Computer Science
  • 2008 Third International Conference on Digital Information Management
  • 2008
DBLP is a bibliographic database with more than one million data entries, collected from the last 70 years, and labeled with diverse attributes like the authorspsila names, the publication title, andExpand
Your Personal, Virtual Librarian
TLDR
Reports of original data should include an abstract of no more than 300 words using the following headings: Context, Objective, Design, Setting, Patients (or Participants), Interventions, Results, and Conclusions. Expand
Automating Document Annotation Using Open Source Knowledge
TLDR
This paper has used crowd-source knowledge bases like Wikipedia and WikiCFP for automating key phrase generation and developed a global context based key-phrase identification approach that generates its global context information using academic search engines like Google Scholar. Expand
OntologyNavigator: WEB 2.0 scalable ontology based CLIR portal to IT scientific corpus for researchers
TLDR
The architecture used in the ongoing OntologyNavigator project is presented, a research tool to help advanced learners to find adapted IT papers to create scientific bibliographies and an ontology translation in French is automatically proposed. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 15 REFERENCES
The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives
TLDR
The most time-consuming task for the maintainers of DBLP may be viewed as a special instance of the authority control problem: how to normalize different spellings of person names. Expand
Browsing and visualizing digital bibliographic data
TLDR
An overview of some important research issues within the field of bibliographical information retrieval and visualization within the DBLP (Digital Bibliography & Library Project) Computer Science Bibliography is given. Expand
Cleaning the spurious links in data
TLDR
Comparing context information between data records can help solve the data quality problem of spurious links-that is, multiple links between data entries and real-world entities. Expand
Comparative study of name disambiguation problem using a scalable blocking-based framework
TLDR
This study identifies combinations that are scalable and effective to disambiguate author names in citations based on a scalable two-step framework and presents extensive experimental results. Expand
A hierarchical naive Bayes mixture model for name disambiguation in author citations
TLDR
This paper presents a hierarchical naive Bayes mixture model, an unsupervised learning approach, for name disambiguation in author citations, which partitions a collection of citations1 into clusters, with each cluster containing only citations authored by the same author, thusdisambiguating authorship in citations to induce author name identities. Expand
Co-authorship networks in the digital library research community
TLDR
The state of the DL domain after a decade of activity is examined by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences, and clear advantages of PageRank and AuthorRank are shown over degree, closeness and betweenness centrality metrics. Expand
On six degrees of separation in DBLP-DB and more
An extensive bibliometric study on the db community using the collaboration network constructed from DBLP data is presented. Among many, we have found that (1) the average distance of all db scholarsExpand
Social Networks Applied
TLDR
The authors investigate the following areas concerning social networks: how to exploit their unprecedented wealth of data and how to mine social networks for purposes such as marketing campaigns; social networks as a particular form of influence; the way that people agree on terminology and this phenomenon's implications for the way the authors build ontologies and the Semantic Web. Expand
Adaptive Name Matching in Information Integration
TLDR
The authors compare and describe methods for combining and learning textual similarity measures for name matching that are essential for information integration. Expand
Data quality for the information age
TLDR
This comprehensive book provides business leaders, process owners, and information professionals with the background and methods necessary to set up a data quality program, make and sustain order of magnitude improvements, and create a unique and important business advantage. Expand
...
1
2
...