S2ORC: The Semantic Scholar Open Research Corpus
- Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Michael Kinney, Daniel S. Weld
- Computer ScienceAnnual Meeting of the Association for…
- 2020
In S2ORC, a large corpus of 81.1M English-language academic papers spanning many academic disciplines is introduced, which is expected to facilitate research and development of tools and tasks for text mining over academic text.
GORC: A large contextual citation graph of academic papers
- Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Michael Kinney, Daniel S. Weld
- Computer ScienceArXiv
- 7 November 2019
We introduce the Semantic Scholar Graph of References in Context (GORC),1 a large contextual citation graph of 81.1M academic publications, including parsed full text for 8.1M open access papers,…
PySBD: Pragmatic Sentence Boundary Disambiguation
- Nipun Sadvilkar, Mark Neumann
- Computer ScienceNLPOSS
- 19 October 2020
This work adapts the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter to Python, ported to Python with additional improvements and functionality.
PAWLS: PDF Annotation With Labels and Structure
- Mark Neumann, Zejiang Shen, Sam Skjonsberg
- Computer ScienceAnnual Meeting of the Association for…
- 25 January 2021
This paper presents PDF Annotation with Labels and Structure (PAWLS), a new annotation tool designed specifically for the PDF document format, particularly suited for mixed-mode annotation and scenarios in which annotators require extended context to annotate accurately.