• Corpus ID: 201667815

Scientific Statement Classification over arXiv.org

  title={Scientific Statement Classification over arXiv.org},
  author={Deyan Ginev and Bruce R. Miller},
We introduce a new classification task for scientific statements and release a large-scale dataset for supervised learning. [] Key Method We demonstrate that the task setup aligns with known success rates from the state of the art, peaking at a 0.91 F1-score via a BiLSTM encoder-decoder model. Additionally, we introduce a lexeme serialization for mathematical formulas, and observe that context-aware models could improve when also trained on the symbolic modality. Finally, we discuss the limitations of both…

Figures and Tables from this paper

ArGoT: A Glossary of Terms extracted from the arXiv
  • Luis Berlioz
  • Computer Science
    Electronic Proceedings in Theoretical Computer Science
  • 2021
This work introduces ArGoT, a data set of mathematical terms extracted from the articles hosted on the arXiv website, and demonstrates how this structure is reflected in the text’s vector representation and how they capture relations of entailment in mathematical concepts.
Towards Explaining STEM Document Classification using Mathematical Entity Linking
First advances towards STEM document classification explainability using classical and mathematical Entity Linking are presented and it is indicated that mathematical entities have the potential to provide high explainability as they are a crucial part of a STEM document.
A Study into Math Document Classification using Deep Learning
This paper examines the optimization of a deep learning (DL) model, the LSTM combined with one dimension CNN, for math document classification, and investigates the model with several input representations, key design parameters and decision choices, and choices of the best input representation.
A Contextual and Labeled Math-Dataset Derived from NIST's DLMF
This paper presents a new dataset that is derived from the widely used Digital Library of Mathematical Functions of NIST, and motivated by the fact that much of ML-based NLP algorithms are sentence oriented.


On the Use of ArXiv as a Dataset
This work provides a pipeline which standardizes and simplifies access to the arXiv's publicly available data, and uses this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles.
Understanding the Logical and Semantic Structure of Large Documents
A framework that can analyze a large document and help people to know where a particular information is in that document is described, which aims to automatically identify and classify semantic sections of documents and assign consistent and human-understandable labels to similar sections across documents.
Document Embedding with Paragraph Vectors
This work observes that the Paragraph Vector method performs significantly better than other methods, and proposes a simple improvement to enhance embedding quality, and shows that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.
Revisiting LSTM Networks for Semi-Supervised Text Classification via Mixed Objective Function
This paper develops a training strategy that allows even a simple BiLSTM model, when trained with cross-entropy loss, to achieve competitive results compared with more complex approaches, and shows the generality of the mixed objective function by improving the performance on relation extraction task.
NTCIR-12 MathIR Task Overview
This overview paper summarizes the task design, corpora, submitted runs, results, and the approaches used by participating groups of the NTCIR-12 MathIR Task.
GloVe: Global Vectors for Word Representation
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Hierarchical Attention Networks for Document Classification
Experiments conducted on six large scale text classification tasks demonstrate that the proposed architecture outperform previous methods by a substantial margin.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
Qualitatively, the proposed RNN Encoder‐Decoder model learns a semantically and syntactically meaningful representation of linguistic phrases.
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Transforming Large Collections of Scientific Publications to XML
The first task of the arXMLiv project is to develop LaTeXML bindings for the (thousands of) LaTEX classes and packages used in the arχiv collection, as well as methods for coping with the eccentricities that TEX encourages.