S2ORC: The Semantic Scholar Open Research Corpus

@inproceedings{Lo2020S2ORCTS,
  title={S2ORC: The Semantic Scholar Open Research Corpus},
  author={Kyle Lo and Lucy Lu Wang and Mark Neumann and Rodney Michael Kinney and Daniel S. Weld},
  booktitle={ACL},
  year={2020}
}
We introduce S2ORC, a large corpus of 81.1M English-language academic papers spanning many academic disciplines. The corpus consists of rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text is annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. In S2ORC, we aggregate papers from hundreds of academic publishers and digital… Expand
Capturing Relations between Scientific Papers: An Abstractive Model for Related Work Section Generation
  • Xiuying Chen, Hind Alamro, +4 authors Rui Yan
  • Computer Science
  • ACL/IJCNLP
  • 2021
Given a set of related publications, related work section generation aims to provide researchers with an overview of the specific research area by summarizing these works and introducing them in aExpand
Importance Assessment in Scholarly Networks
TLDR
The proposed metric, denoted by Content Informed Index (CII), uses the content of the paper as a source of distant-supervision, to weight the edges of a citation network to derive impact metrics for the various entities involved, like the publications, authors etc. Expand
Citation Intent Classification Using Word Embedding
TLDR
This study critically investigated the available datasets for citation intent and proposed an automated citation intent technique to label the citation context with citation intent, which will enhance the study of citation context analysis. Expand
CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding
TLDR
This work introduces CITEWORTH, a large, contextualized, rigorously cleaned labelled dataset for cite-worthiness detection built from a massive corpus of extracted plaintext scientific documents, and shows that CITewORTH is high-quality, challenging, and suitable for studying problems such as domain adaptation. Expand
Enhancing Scientific Papers Summarization with Citation Graph
TLDR
This paper proposes a citation graph-based summarization model (CGSUM) which can incorporate the information of both the source paper and its references and constructs a novel scientific papers summarization dataset Semantic Scholar Network (SSN), which constitutes a large connected citation graph. Expand
ParaSCI: A Large Scientific Paraphrase Dataset for Longer Paraphrase Generation
TLDR
The first large-scale paraphrase dataset in the scientific field, including 33,981 paraphrase pairs from ACL and arXiv is proposed, and PDBERT is put up as a general paraphrase discovering method to take advantage of sentences paraphrased partially. Expand
Recovering Lexically and Semantically Reused Texts
TLDR
A number of previously-unexplored questions in the study of LTRD are shed light, including the importance of incorporating document-level context for predictions, the applicability of of-the-shelf neural models pretrained on “general” semantic textual similarity tasks such as paraphrase detection, and the trade-offs between more efficient bag-of-words and feature-based neural models and slower pairwise neural models. Expand
SChuBERT: Scholarly Document Chunks with BERT-encoding boost Citation Count Prediction.
TLDR
This work uses the open access ACL Anthology collection in combination with the Semantic Scholar bibliometric database to create a large corpus of scholarly documents with associated citation information and proposes a new citation prediction model called SChuBERT, which outperforms previous methods by a large margin. Expand
Structure-Tags Improve Text Classification for Scholarly Document Quality Prediction
TLDR
This work proposes the use of HANs combined with structure-tags which mark the role of sentences in the document, adding tags to sentences, which yields improvements over the state-of-the-art for scholarly document quality prediction. Expand
Analysing the Requirements for an Open Research Knowledge Graph: Use Cases, Quality Requirements and Construction Strategies
TLDR
This paper presents a comprehensive analysis of requirements for an Open Research Knowledge Graph (ORKG), collecting and reviewing daily core tasks of a scientist and establishing their consequential requirements for a KG-based system, and identifying overlaps and specificities. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 58 REFERENCES
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
A Neural Probabilistic Model for Context Based Citation Recommendation
TLDR
A novel neural probabilistic model that jointly learns the semantic representations of citation contexts and cited papers is proposed that significantly outperforms other state-of-the-art models in recall, MAP, MRR, and nDCG. Expand
The ACL anthology network corpus
We introduce the ACL Anthology Network (AAN), a comprehensive manually curated networked database of citations, collaborations, and summaries in the field of Computational Linguistics. We alsoExpand
The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics
TLDR
This is a post-print of a paper from Sixth International Conference on Language Resources and Evaluation 2008, where six papers were presented, one of which was new to the literature. Expand
CiteSeer: an automatic citation indexing system
TLDR
CiteSeer has many advantages over traditional citation indexes, including the ability to create more up-to-date databases which are not limited to a preselected set of journals or restricted by journal publication delays, completely autonomous operation with a corresponding reduction in cost, and powerful interactive browsing of the literature using the context of citations. Expand
A Web-scale system for scientific knowledge exploration
TLDR
This work presents a large-scale system to identify hundreds of thousands of scientific concepts, tag these identified concepts to hundreds of millions of scientific publications, and build a six-level concept hierarchy with a subsumption-based model. Expand
Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers
TLDR
This study applies, evaluates and compares ten reference parsing tools in a specific business use case, and confirms that tuning the models to the task-specific data results in the increase in the quality. Expand
Machine learning vs
  • rules and out-of-the-box vs. retrained: An evaluation of open-source bibliographic reference and citation parsers. In Proceedings of the 18th ACM/IEEE
  • 2018
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
TLDR
GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models. Expand
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
...
1
2
3
4
5
...