• Corpus ID: 562945

Automatic Identification of Research Articles from Crawled Documents

@inproceedings{Caragea2014AutomaticIO,
  title={Automatic Identification of Research Articles from Crawled Documents},
  author={Cornelia Caragea and Jian Wu and Kyle Williams and Sujatha G. Das and Madian Khabsa and Pradeep B. Teregowda and C. Lee Giles},
  booktitle={WSDM 2014},
  year={2014}
}
Paper from the Web-Scale Classification: Classifying Big Data from the Web Workshop. This paper proposes novel features that result in effective and efficient classification models for automatic identification of research articles. 

Figures and Tables from this paper

Big Scholarly Data in CiteSeerX: Information Extraction from the Web
We examine CiteSeerX, an intelligent system designed with the goal of automatically acquiring and organizing large-scale collections of scholarly documents from the world wide web. From the
Information Extraction for Scholarly Document Big Data
TLDR
Key extraction technologies used in CiteSeerX are presented, including document classification and de-duplication, document clustering, header/citation extraction, author disambiguation, and table/algorithm extraction.
Document Type Classification in Online Digital Libraries
TLDR
This work proposes novel features that result in high-accuracy classifiers for document type classification and shows that these classifiers outperform models that are employed in current systems.
CiteSeerX : Intelligent Information Extraction and Knowledge Creation from Web-Based Data
TLDR
CiteSeerX provides free access to over 4 million full-text academic documents and rarely seen fuctionalities, e.g., table search, which has been used for many data mining projects.
Automated Identification of Computer Science Research Papers
TLDR
With large size of training set, Bi-gram modeling with normalized feature weight performs the best for all the two data sets and it is surprising that arXiv data set can be classified up to 0.95 F1 value, while CiteSeerX reaches lower F1 (0.764).
An Empirical Analysis of Big Scholarly Data to Find the Increase in Citations
The research quality and productivity of a research area are decided by the number of research articles and citations. Several factors affect the citation count of a research article. The objective
Document Analysis and Retrieval Tasks in Scientific Digital Libraries
TLDR
This tutorial focuses on open-access, scientific digital libraries such as CiteSeer, which involve several crawling, ranking, content analysis, and metadata extraction tasks, and elaborate on the challenges involved and how machine learning methods can successfully address these challenges.
Dynamic Classification in Web Archiving Collections
TLDR
This paper explores dynamic fusion models to find, on the fly, the model or combination of models that performs best on a variety of document types and shows that the approach that fuses different models outperforms individual models and other ensemble methods on three datasets.
Improving Researcher Homepage Classification with Unlabeled Data
TLDR
It is shown that although several algorithms are effective in using information from the unlabeled instances, co-training that explicitly harnesses the feature split in the underlying instances outperforms approaches that combine content and URL features into a single view.
An amoeboid approach for identifying optimal citation flow in big scholarly data network
  • Nivash J P, D. L D
  • Computer Science
    International Journal of Communication Systems
  • 2018
TLDR
An amoeboid approach article‐optimal citation flow (A‐OCF) is used to find the optimal flow of citations in the big scholarly data network and a novel modern metrics for article quality (MMAQ) metric is proposed to identify the quality of articles.
...
1
2
3
...

References

SHOWING 1-10 OF 21 REFERENCES
Web search using automatic classification
TLDR
The research indicates that Web classification and search tools must compensate for artifices such as Web spamming that have resulted from the very existence of such tools.
Web page classification: Features and algorithms
TLDR
As work in Web page classification is reviewed, the importance of these Web-specific features and algorithms are noted, state-of-the-art practices are described, and the underlying assumptions behind the use of information from neighboring pages are tracked.
The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists
TLDR
It is found that crawling the whitelist significantly increases the crawl precision by reducing a large amount of irrelevant requests and downloads.
Fast webpage classification using URL features
TLDR
This work demonstrates the usefulness of the uniform resource locator (URL) alone in performing web page classification and shows that in certain scenarios, URL-based methods approach the performance of current state-of-the-art full-text and link- based methods.
Classifying Scientific Publications Using Abstract Features
TLDR
This paper compares feature abstraction with two other methods for dimensionality reduction, i.e., feature selection and Latent Dirichlet Allocation (LDA), and proposes an approach to automatic identification of a cut in order to trade off the complexity of classifiers against their performance.
Web-page classification through summarization
TLDR
This paper gives empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web- page classification algorithms and proposes a new Web summarization-based classification algorithm that achieves an approximately 8.8% improvement over pure-text based methods.
ArnetMiner: extraction and mining of academic social networks
TLDR
The architecture and main features of the ArnetMiner system, which aims at extracting and mining academic social networks, are described and a unified modeling approach to simultaneously model topical aspects of papers, authors, and publication venues is proposed.
Researcher homepage classification using unlabeled data
TLDR
It is demonstrated that tuning the classifiers so that they make "similar" predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms, and argued that this loss formulation provides insight into understanding the co- training process and can be used even in absence of a validation set.
CiteSeer: an automatic citation indexing system
TLDR
CiteSeer has many advantages over traditional citation indexes, including the ability to create more up-to-date databases which are not limited to a preselected set of journals or restricted by journal publication delays, completely autonomous operation with a corresponding reduction in cost, and powerful interactive browsing of the literature using the context of citations.
Similar researcher search in academic environments
TLDR
To the best of the knowledge, this work is the first to address content-based researcher recommendation in an academic setting and demonstrate it for Computer Science via the system, ScholarSearch.
...
1
2
3
...