SPECTER: Document-level Representation Learning using Citation-informed Transformers

@article{Cohan2020SPECTERDR,
  title={SPECTER: Document-level Representation Learning using Citation-informed Transformers},
  author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.07180}
}
Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, accurate embeddings of documents are… 

Figures and Tables from this paper

Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations
TLDR
The BioCreative LitCovid track was organized to call for a community effort to tackle automated topic annotation for COVID-19 literature and proposes an ensemble learning-based method that utilizes multiple biomedical pre-trained models.
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings
TLDR
The resulting method SciNCL outperforms the state-of-the- 021 art on the SciDocs benchmark and can train sample-efficiently, and is demonstrated that it can be combined with recent training-ef ficient methods and performs baselines pretrained in-domain.
CSFCube - A Test Collection of Computer Science Research Articles for Faceted Query by Example
TLDR
This work introduces the task of faceted Query by Example in which users can also specify a finer grained aspect in addition to the input query document, and describes an expert annotated test collection to evaluate models trained to perform this task.
Improving BERT-based Query-by-Document Retrieval with Multi-Task Optimization
TLDR
This work improves the retrieval effectiveness of the BERT re-ranker, proposing an extension to its fine-tuning step to better exploit the context of queries and uses an additional document-level representation learning objective besides the ranking objective when fine- Tuning the Bert re-Ranker.
Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review
TLDR
This work proposes the first intertextual model of text-based collaboration, which encompasses three major phenomena that make up a full iteration of the review-revise-and-resubmit cycle: pragmatic tagging, linking and long-document version alignment.
Specialized Document Embeddings for Aspect-based Similarity of Research Papers
TLDR
The approach of aspect-based document embeddings mitigates potential risks arising from implicit biases by making them explicit and can, for example, be used for more diverse and explainable recommendations.
Structure and Semantics Preserving Document Representations
TLDR
This work proposes here a holistic approach to learning document representations by integrating intra-document content with inter-document relations and demonstrates that this model outperforms competing methods on multiple datasets for document retrieval tasks.
Structure with Semantics: Exploiting Document Relations for Retrieval
TLDR
This deep metric learning solution analyzes the complex neighborhood structure in the relationship network to efficiently sample similar/dissimilar document pairs and defines a novel quintuplet loss function that simultaneously encourages document pairs that are semantically relevant to be closer and structurally unrelated to be far apart in the representation space.
Textomics: A Dataset for Genomics Data Summary Generation
Summarizing biomedical discovery from genomics data using natural languages is an essential step in biomedical research but is mostly done manually. Here, we introduce Textomics, a novel dataset of
The Inefficiency of Language Models in Scholarly Retrieval: An Experimental Walk-through
TLDR
Retrieval performance turns out to be more influenced by the surface form rather than the semantics of the text, and an exhaustive categorization yields several classes of orthographically and semantically related, partially related and completely unrelated neighbors.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 59 REFERENCES
A Comprehensive Survey on Graph Neural Networks
TLDR
This article provides a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields and proposes a new taxonomy to divide the state-of-the-art GNNs into four categories, namely, recurrent GNNS, convolutional GNN’s, graph autoencoders, and spatial–temporal Gnns.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
An Overview of Microsoft Academic Service (MAS) and Applications
TLDR
A knowledge driven, highly interactive dialog that seamlessly combines reactive search and proactive suggestion experience, and a proactive heterogeneous entity recommendation are demonstrated.
Improving Textual Network Embedding with Global Attention via Optimal Transport
TLDR
This work reformulate the network embedding problem, and presents two novel strategies to improve over traditional attention mechanisms: (i) a content-aware sparse attention module based on optimal transport; and (ii) a high-level attention parsing module.
Improving Textual Network Learning with Variational Homophilic Embeddings
TLDR
Variational Homophilic Embedding is introduced, a fully generative model that learns network embeddings by modeling the semantic (textual) information with a variational autoencoder, while accounting for the structural information through a novel homophilic prior design.
Simplifying Graph Convolutional Networks
TLDR
This paper successively removes nonlinearities and collapsing weight matrices between consecutive layers, and theoretically analyze the resulting linear model and show that it corresponds to a fixed low-pass filter followed by a linear classifier.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Inductive Representation Learning on Large Graphs
TLDR
GraphSAGE is presented, a general, inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data and outperforms strong baselines on three inductive node-classification benchmarks.
SciBERT: A Pretrained Language Model for Scientific Text
TLDR
SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks and demonstrates statistically significant improvements over BERT.
...
1
2
3
4
5
...