Rule-based word clustering for document metadata extraction

@inproceedings{Han2005RulebasedWC,
  title={Rule-based word clustering for document metadata extraction},
  author={Hui Han and Eren Manavoglu and Hongyuan Zha and Kostas Tsioutsiouliklis and C. Lee Giles and Xiangmin Zhang},
  booktitle={SAC '05},
  year={2005}
}
Text classification is still an important problem for unlabeled text; CiteSeer, a computer science document search engine, uses automatic text classification methods for document indexing. Text classification uses a document's original text words as the primary feature representation. However, such representation usually comes with high dimensionality and feature sparseness. Word clustering is an effective approach to reduce feature dimensionality and feature sparseness, and improve text… 

Figures and Tables from this paper

Document features selection using background knowledge and word clustering technique
TLDR
The results of this proposed method simulation shows that the documents dimensions are decreased effectively and consequently the performance of documents clustering is increased.
Document Classification in Support of Automated Metadata Extraction Form Heterogeneous Collections.
TLDR
This dissertation examines the evolution and all the major components o f an automated metadata extraction system and investigates alternative methods of document classification to replace or supplement post hoc classification.
Word Clustering Based on Un-LP Algorithm
TLDR
An unsupervised label propagation algorithm (Un-LP) for word clustering which uses multi-exemplars to represent a cluster and experiments on a synthetic 2D dataset show the strong ability of selfcorrecting of the proposed algorithm.
A DOCUMENT ENGINEERING APPROACH TO AUTOMATIC EXTRACTION OF SHALLOW METADATA FROM SCIENTIFIC PUBLICATIONS
TLDR
A solution for automatic metadata extraction from scientific publications, published as PDF documents is described, by combining mining and analysis of the publications’ text based on its formatting style and font information.
A Rule-Based Framework of Metadata Extraction from Scientific Papers
  • Zhixin Guo, Hai Jin
  • Computer Science
    2011 10th International Symposium on Distributed Computing and Applications to Business, Engineering and Science
  • 2011
TLDR
A framework for automatic metadata extraction from scientific papers is described, based on a spatial and visual knowledge principle, which can extract title, authors and abstract from science papers.
A comparison of layout based bibliographic metadata extraction techniques
TLDR
This paper compares style and content features on existing state-of-the-art methods on two newly created real-world data sets for metadata extraction and shows that two-stage SVMs provide reasonable performance in solving the challenge of metadata extraction for crowdsourcing bibliographic metadata management.
Vision and natural language for metadata extraction from scientific PDF documents: a multimodal approach
TLDR
A multimodal neural network model that employs NLP together with Computer Vision together with CV for metadata extraction from scientific PDF documents is proposed to benefit from both modalities to increase the overall accuracy of metadata extraction.
Metadata extraction with cue model
TLDR
This paper proposes a new technique to extract metadata of documents, called Metadata Extraction with Cue Model, which uses combinations of a few features to extractmetadata automatically from documents.
Extraction from a Bibliographic Database
TLDR
A rule based information extraction process, on the selected data extracted from a bibliographic database of published R&D papers is proposed in this paper to build up a database on relevant concepts, cleaning of retrieved data and automate the process of information retrieval in the local database.
Multi-View Meets Average Linkage: Exploring the Role of Metadata in Document Clustering
TLDR
This paper embeds the idea of Multi-Viewpoint Based Similarity Measure for clustering MVSC into a hierarchical clustering method, i.e., average linkage clustering, to overcome the problem of initiation with random seeds, resulting in a new algorithm, referred to as MVSC-HAC.
...
...

References

SHOWING 1-10 OF 19 REFERENCES
Distributional clustering of words for text classification
TLDR
This paper describes the application of Distributional Clustering to document classification and shows that it can reduce the feature dimensional&y by three orders of magnitude and lose only 2% accuracy-significantly better than Latent Semantic Indexing, class-based clustering, feature selection by mutual information, or Markov-blanket-based feature selection.
The Power of Word Clusters for Text Classification
TLDR
This work applies the information bottleneck method to find word-clusters that preserve the information about document categories and use these clusters as features for classification, and shows that when the training sample is small word clusters can yield significant improvement in classification accuracy.
A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification
TLDR
A new information-theoretic divisive algorithm for feature/word clustering and apply it to text classification is proposed and it is shown that feature clustering is an effective technique for building smaller class models in hierarchical classification.
Automatic document metadata extraction using support vector machines
TLDR
It is found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance and an appropriate feature normalization also greatly improves the classification performance.
A Comparative Study on Feature Selection in Text Categorization
TLDR
This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.
Bibliographic attribute extraction from erroneous references based on a statistical model
  • A. Takasu
  • Computer Science
    2003 Joint Conference on Digital Libraries, 2003. Proceedings.
  • 2003
TLDR
A statistical model for attribute extraction that represents both the syntactical structure of references and OCR error patterns is proposed and it is shown that the proposed model has advantages in reducing the cost of preparing training data.
Document clustering with committees
TLDR
A new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key) is presented, which is more intuitive and easier to interpret than previous evaluation measures.
Mining the peanut gallery: opinion extraction and semantic classification of product reviews
TLDR
This work develops a method for automatically distinguishing between positive and negative reviews and draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation.
Learning Hidden Markov Model Structure for Information Extraction
TLDR
It is demonstrated that a manually-constructed model that contains multiple states per extraction field outperforms a model with one state per field, and the use of distantly-labeled data to set model parameters provides a significant improvement in extraction accuracy.
Improving category specific Web search by learning query modifications
TLDR
An automated method for learning query modifications that can dramatically improve precision for locating pages within specified categories using Web search engines and a classification procedure that can recognize pages in a specific category with high precision is presented.
...
...