Corpus ID: 235694574

Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations

@article{Choudhury2021AutomaticME,
  title={Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations},
  author={Muntabir Hasan Choudhury and Himarsha R. Jayanetti and Jian Wu and William A. Ingram and Edward A. Fox},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.00516}
}
Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents, so they often fail to extract metadata from scanned documents such as for ETDs. Traditional sequence tagging methods mainly rely on text-based features. In… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 14 REFERENCES
Automatic document metadata extraction using support vector machines
TLDR
It is found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance and an appropriate feature normalization also greatly improves the classification performance. Expand
CERMINE: automatic extraction of structured metadata from scientific literature
TLDR
The overall workflow architecture of CERMINE is outlined, details about individual steps implementations are provided and the evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types. Expand
Evaluation of header metadata extraction approaches and tools for scientific PDF documents
TLDR
In the evaluation using papers from the arXiv collection, GROBID delivered the best results, followed by Mendeley Desktop, and SciPlore Xtract, PDFMeat, and SVMHeaderParse also delivered good results depending on the metadata type to be extracted. Expand
HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities
TLDR
This work investigates a variant of the problem of automatic keyphrase extraction from scientific documents, which is defined as Scientific Domain Knowledge Entity (SDKE) extraction and suggests that it is possible to improve the accuracy of the sequential learners itself by utilizing the predictions of the non-sequential model. Expand
Keyphrase Extraction using Sequential Labeling
TLDR
A basic set of features commonly used in NLP tasks as well as predictions from various unsupervised methods to train the taggers are explored and it is shown that tagging models yield significant performance benefits over existing state-of-the-art extraction methods. Expand
GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications
Based on state of the art machine learning techniques, GROBID (GeneRation Of BIbliographic Data) performs reliable bibliographic data extractions from scholar articles combined with multi-level termExpand
Aligning Ground Truth Text with OCR Degraded Text
TLDR
This paper proposes an alignment algorithm that when tested with the TREC-5 data set achieves an initial alignment accuracy average of 98.547% without zoning problems and 81.07% with. Expand
ParsCit: an Open-source CRF Reference String Parsing Package
TLDR
Parsing package ParsCit is described, a freely available, open-source implementation of a reference string parsing package that wraps a trained conditional random field model with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts. Expand
Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance
TLDR
Edlib is presented, an open‐source C/C ++ library for exact pairwise sequence alignment using edit distance and is expected to be easily adopted as a building block for future bioinformatics tools. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
...
1
2
...