Learn More
In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the existing methods to evaluate and solve the imbalance problem.(More)
Due to name abbreviations, identical names, name misspellings, and pseudonyms inpublications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, web search, database integration, and may cause improper attribution to authors. This(More)
An author may have multiple names and multiple authors may share the same name simply due to name abbreviations, identical names, or name misspellings in publications or bibliographies <sup>1</sup>. This can produce name ambiguity which can affect the performance of document retrieval, web search, and database integration, and may cause improper attribution(More)
Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a Support Vector Machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other(More)
Because of name variations, an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, web search, database integration, and may cause improper attribution to authors. This paper presents a hierarchical naive Bayes mixture model, an unsupervised learning approach, for(More)
Acknowledgements in research publications, like citations, indicate influential contributions to scientific work; however, large-scale acknowledgement analyses have traditionally been impractical due to the high cost of manual information extraction. In this paper we describe a mixture method for automatically mining acknowledgements from research documents(More)
Quantitative susceptibility mapping (QSM) is a novel MRI method for quantifying tissue magnetic property. In the brain, it reflects the molecular composition and microstructure of the local tissue. However, susceptibility maps reconstructed from single-orientation data still suffer from streaking artifacts which obscure structural details and small lesions.(More)
Mortality from liver cancer in humans is increasingly attributable to heavy or long-term alcohol consumption. The mechanisms by which alcohol exerts its carcinogenic effect are not well understood. In this study, the role of alcohol-induced endoplasmic reticulum (ER) stress response in liver cancer development was investigated using an animal model with a(More)
CiteSeer is currently a very large source of meta-data information on the World Wide Web (WWW). This meta-data is the key material for the Semantic Web. Still, CiteSeer is not yet a Semantic-enabled service and therefore its meta-data, although potentially usable by Semantic Web agents, is not yet reachable using the Semantic Web mechanisms. The complexity(More)
Text classification is still an important problem for unlabeled text; CiteSeer, a computer science document search engine, uses automatic text classification methods for document indexing. Text classification uses a document's original text words as the primary feature representation. However, such representation usually comes with high dimensionality and(More)