# Scientific Statement Classification over arXiv.org

@inproceedings{Ginev2020ScientificSC, title={Scientific Statement Classification over arXiv.org}, author={Deyan Ginev and Bruce R. Miller}, booktitle={LREC}, year={2020} }

We introduce a new classification task for scientific statements and release a large-scale dataset for supervised learning. [... ] Key Method We demonstrate that the task setup aligns with known success rates from the state of the art, peaking at a 0.91 F1-score via a BiLSTM encoder-decoder model. Additionally, we introduce a lexeme serialization for mathematical formulas, and observe that context-aware models could improve when also trained on the symbolic modality. Finally, we discuss the limitations of both… Expand

## 4 Citations

ArGoT: A Glossary of Terms extracted from the arXiv

- Computer ScienceElectronic Proceedings in Theoretical Computer Science
- 2021

This work introduces ArGoT, a data set of mathematical terms extracted from the articles hosted on the arXiv website, and demonstrates how this structure is reflected in the text’s vector representation and how they capture relations of entailment in mathematical concepts.

Towards Explaining STEM Document Classification using Mathematical Entity Linking

- Computer ScienceArXiv
- 2021

First advances towards STEM document classification explainability using classical and mathematical Entity Linking are presented and it is indicated that mathematical entities have the potential to provide high explainability as they are a crucial part of a STEM document.

A Study into Math Document Classification using Deep Learning

- Computer ScienceComputer Science & Information Technology (CS & IT)
- 2020

This paper examines the optimization of a deep learning (DL) model, the LSTM combined with one dimension CNN, for math document classification, and investigates the model with several input representations, key design parameters and decision choices, and choices of the best input representation.

A Contextual and Labeled Math-Dataset Derived from NIST's DLMF

- Computer ScienceCICM
- 2020

This paper presents a new dataset that is derived from the widely used Digital Library of Mathematical Functions of NIST, and motivated by the fact that much of ML-based NLP algorithms are sentence oriented.

## References

SHOWING 1-10 OF 40 REFERENCES

On the Use of ArXiv as a Dataset

- Computer ScienceArXiv
- 2019

This work provides a pipeline which standardizes and simplifies access to the arXiv's publicly available data, and uses this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles.

Understanding the Logical and Semantic Structure of Large Documents

- Computer ScienceSDM 2017
- 2017

A framework that can analyze a large document and help people to know where a particular information is in that document is described, which aims to automatically identify and classify semantic sections of documents and assign consistent and human-understandable labels to similar sections across documents.

Document Embedding with Paragraph Vectors

- Computer ScienceArXiv
- 2015

This work observes that the Paragraph Vector method performs significantly better than other methods, and proposes a simple improvement to enhance embedding quality, and shows that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.

Revisiting LSTM Networks for Semi-Supervised Text Classification via Mixed Objective Function

- Computer ScienceAAAI
- 2019

This paper develops a training strategy that allows even a simple BiLSTM model, when trained with cross-entropy loss, to achieve competitive results compared with more complex approaches, and shows the generality of the mixed objective function by improving the performance on relation extraction task.

NTCIR-12 MathIR Task Overview

- Computer ScienceNTCIR
- 2016

This overview paper summarizes the task design, corpora, submitted runs, results, and the approaches used by participating groups of the NTCIR-12 MathIR Task.

GloVe: Global Vectors for Word Representation

- Computer ScienceEMNLP
- 2014

A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

Hierarchical Attention Networks for Document Classification

- Computer ScienceNAACL
- 2016

Experiments conducted on six large scale text classification tasks demonstrate that the proposed architecture outperform previous methods by a substantial margin.

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

- Computer ScienceEMNLP
- 2014

Qualitatively, the proposed RNN Encoder‐Decoder model learns a semantically and syntactically meaningful representation of linguistic phrases.

Adam: A Method for Stochastic Optimization

- Computer ScienceICLR
- 2015

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Transforming Large Collections of Scientific Publications to XML

- Computer ScienceMath. Comput. Sci.
- 2010

The first task of the arXMLiv project is to develop LaTeXML bindings for the (thousands of) LaTEX classes and packages used in the arχiv collection, as well as methods for coping with the eccentricities that TEX encourages.