Rule-based word clustering for document metadata extraction
@inproceedings{Han2005RulebasedWC, title={Rule-based word clustering for document metadata extraction}, author={Hui Han and Eren Manavoglu and Hongyuan Zha and Kostas Tsioutsiouliklis and C. Lee Giles and Xiangmin Zhang}, booktitle={SAC '05}, year={2005} }
Text classification is still an important problem for unlabeled text; CiteSeer, a computer science document search engine, uses automatic text classification methods for document indexing. Text classification uses a document's original text words as the primary feature representation. However, such representation usually comes with high dimensionality and feature sparseness. Word clustering is an effective approach to reduce feature dimensionality and feature sparseness, and improve text…
37 Citations
Document features selection using background knowledge and word clustering technique
- Computer Science
- 2014
The results of this proposed method simulation shows that the documents dimensions are decreased effectively and consequently the performance of documents clustering is increased.
Document Classification in Support of Automated Metadata Extraction Form Heterogeneous Collections.
- Computer Science
- 2014
This dissertation examines the evolution and all the major components o f an automated metadata extraction system and investigates alternative methods of document classification to replace or supplement post hoc classification.
Word Clustering Based on Un-LP Algorithm
- Computer Science
- 2014
An unsupervised label propagation algorithm (Un-LP) for word clustering which uses multi-exemplars to represent a cluster and experiments on a synthetic 2D dataset show the strong ability of selfcorrecting of the proposed algorithm.
A DOCUMENT ENGINEERING APPROACH TO AUTOMATIC EXTRACTION OF SHALLOW METADATA FROM SCIENTIFIC PUBLICATIONS
- Computer Science
- 2009
A solution for automatic metadata extraction from scientific publications, published as PDF documents is described, by combining mining and analysis of the publications’ text based on its formatting style and font information.
A Rule-Based Framework of Metadata Extraction from Scientific Papers
- Computer Science2011 10th International Symposium on Distributed Computing and Applications to Business, Engineering and Science
- 2011
A framework for automatic metadata extraction from scientific papers is described, based on a spatial and visual knowledge principle, which can extract title, authors and abstract from science papers.
A comparison of layout based bibliographic metadata extraction techniques
- Computer ScienceWIMS '12
- 2012
This paper compares style and content features on existing state-of-the-art methods on two newly created real-world data sets for metadata extraction and shows that two-stage SVMs provide reasonable performance in solving the challenge of metadata extraction for crowdsourcing bibliographic metadata management.
Vision and natural language for metadata extraction from scientific PDF documents: a multimodal approach
- Computer ScienceJCDL
- 2022
A multimodal neural network model that employs NLP together with Computer Vision together with CV for metadata extraction from scientific PDF documents is proposed to benefit from both modalities to increase the overall accuracy of metadata extraction.
Metadata extraction with cue model
- Computer Science
- 2008
This paper proposes a new technique to extract metadata of documents, called Metadata Extraction with Cue Model, which uses combinations of a few features to extractmetadata automatically from documents.
Extraction from a Bibliographic Database
- Computer Science
- 2018
A rule based information extraction process, on the selected data extracted from a bibliographic database of published R&D papers is proposed in this paper to build up a database on relevant concepts, cleaning of retrieved data and automate the process of information retrieval in the local database.
Multi-View Meets Average Linkage: Exploring the Role of Metadata in Document Clustering
- Computer ScienceInt. J. Inf. Retr. Res.
- 2015
This paper embeds the idea of Multi-Viewpoint Based Similarity Measure for clustering MVSC into a hierarchical clustering method, i.e., average linkage clustering, to overcome the problem of initiation with random seeds, resulting in a new algorithm, referred to as MVSC-HAC.
References
SHOWING 1-10 OF 19 REFERENCES
Distributional clustering of words for text classification
- Computer ScienceSIGIR '98
- 1998
This paper describes the application of Distributional Clustering to document classification and shows that it can reduce the feature dimensional&y by three orders of magnitude and lose only 2% accuracy-significantly better than Latent Semantic Indexing, class-based clustering, feature selection by mutual information, or Markov-blanket-based feature selection.
The Power of Word Clusters for Text Classification
- Computer Science
- 2006
This work applies the information bottleneck method to find word-clusters that preserve the information about document categories and use these clusters as features for classification, and shows that when the training sample is small word clusters can yield significant improvement in classification accuracy.
A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification
- Computer ScienceJ. Mach. Learn. Res.
- 2003
A new information-theoretic divisive algorithm for feature/word clustering and apply it to text classification is proposed and it is shown that feature clustering is an effective technique for building smaller class models in hierarchical classification.
Automatic document metadata extraction using support vector machines
- Computer Science2003 Joint Conference on Digital Libraries, 2003. Proceedings.
- 2003
It is found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance and an appropriate feature normalization also greatly improves the classification performance.
A Comparative Study on Feature Selection in Text Categorization
- Computer ScienceICML
- 1997
This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.
Bibliographic attribute extraction from erroneous references based on a statistical model
- Computer Science2003 Joint Conference on Digital Libraries, 2003. Proceedings.
- 2003
A statistical model for attribute extraction that represents both the syntactical structure of references and OCR error patterns is proposed and it is shown that the proposed model has advantages in reducing the cost of preparing training data.
Document clustering with committees
- Computer ScienceSIGIR '02
- 2002
A new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key) is presented, which is more intuitive and easier to interpret than previous evaluation measures.
Mining the peanut gallery: opinion extraction and semantic classification of product reviews
- Computer ScienceWWW '03
- 2003
This work develops a method for automatically distinguishing between positive and negative reviews and draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation.
Learning Hidden Markov Model Structure for Information Extraction
- Computer Science
- 1999
It is demonstrated that a manually-constructed model that contains multiple states per extraction field outperforms a model with one state per field, and the use of distantly-labeled data to set model parameters provides a significant improvement in extraction accuracy.
Improving category specific Web search by learning query modifications
- Computer ScienceProceedings 2001 Symposium on Applications and the Internet
- 2001
An automated method for learning query modifications that can dramatically improve precision for locating pages within specified categories using Web search engines and a classification procedure that can recognize pages in a specific category with high precision is presented.