SPECTER: Document-level Representation Learning using Citation-informed Transformers
@article{Cohan2020SPECTERDR, title={SPECTER: Document-level Representation Learning using Citation-informed Transformers}, author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld}, journal={ArXiv}, year={2020}, volume={abs/2004.07180} }
Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, accurate embeddings of documents are…
100 Citations
Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations
- Computer ScienceArXiv
- 2022
The BioCreative LitCovid track was organized to call for a community effort to tackle automated topic annotation for COVID-19 literature and proposes an ensemble learning-based method that utilizes multiple biomedical pre-trained models.
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings
- Computer ScienceArXiv
- 2022
The resulting method SciNCL outperforms the state-of-the- 021 art on the SciDocs benchmark and can train sample-efficiently, and is demonstrated that it can be combined with recent training-ef ficient methods and performs baselines pretrained in-domain.
CSFCube - A Test Collection of Computer Science Research Articles for Faceted Query by Example
- Computer ScienceNeurIPS Datasets and Benchmarks
- 2021
This work introduces the task of faceted Query by Example in which users can also specify a finer grained aspect in addition to the input query document, and describes an expert annotated test collection to evaluate models trained to perform this task.
Improving BERT-based Query-by-Document Retrieval with Multi-Task Optimization
- Computer ScienceECIR
- 2022
This work improves the retrieval effectiveness of the BERT re-ranker, proposing an extension to its fine-tuning step to better exploit the context of queries and uses an additional document-level representation learning objective besides the ranking objective when fine- Tuning the Bert re-Ranker.
Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review
- Computer ScienceArXiv
- 2022
This work proposes the first intertextual model of text-based collaboration, which encompasses three major phenomena that make up a full iteration of the review-revise-and-resubmit cycle: pragmatic tagging, linking and long-document version alignment.
Specialized Document Embeddings for Aspect-based Similarity of Research Papers
- Computer ScienceArXiv
- 2022
The approach of aspect-based document embeddings mitigates potential risks arising from implicit biases by making them explicit and can, for example, be used for more diverse and explainable recommendations.
Structure and Semantics Preserving Document Representations
- Computer Science
- 2022
This work proposes here a holistic approach to learning document representations by integrating intra-document content with inter-document relations and demonstrates that this model outperforms competing methods on multiple datasets for document retrieval tasks.
Structure with Semantics: Exploiting Document Relations for Retrieval
- Computer ScienceArXiv
- 2022
This deep metric learning solution analyzes the complex neighborhood structure in the relationship network to efficiently sample similar/dissimilar document pairs and defines a novel quintuplet loss function that simultaneously encourages document pairs that are semantically relevant to be closer and structurally unrelated to be far apart in the representation space.
Textomics: A Dataset for Genomics Data Summary Generation
- Computer ScienceACL
- 2022
Summarizing biomedical discovery from genomics data using natural languages is an essential step in biomedical research but is mostly done manually. Here, we introduce Textomics, a novel dataset of…
The Inefficiency of Language Models in Scholarly Retrieval: An Experimental Walk-through
- Computer ScienceFINDINGS
- 2022
Retrieval performance turns out to be more influenced by the surface form rather than the semantics of the text, and an exhaustive categorization yields several classes of orthographically and semantically related, partially related and completely unrelated neighbors.
References
SHOWING 1-10 OF 59 REFERENCES
A Comprehensive Survey on Graph Neural Networks
- Computer ScienceIEEE Transactions on Neural Networks and Learning Systems
- 2019
This article provides a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields and proposes a new taxonomy to divide the state-of-the-art GNNs into four categories, namely, recurrent GNNS, convolutional GNN’s, graph autoencoders, and spatial–temporal Gnns.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Computer ScienceNAACL
- 2019
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
A Simple but Tough-to-Beat Baseline for Sentence Embeddings
- Computer ScienceICLR
- 2017
An Overview of Microsoft Academic Service (MAS) and Applications
- Computer ScienceWWW
- 2015
A knowledge driven, highly interactive dialog that seamlessly combines reactive search and proactive suggestion experience, and a proactive heterogeneous entity recommendation are demonstrated.
Improving Textual Network Embedding with Global Attention via Optimal Transport
- Computer ScienceACL
- 2019
This work reformulate the network embedding problem, and presents two novel strategies to improve over traditional attention mechanisms: (i) a content-aware sparse attention module based on optimal transport; and (ii) a high-level attention parsing module.
Improving Textual Network Learning with Variational Homophilic Embeddings
- Computer ScienceNeurIPS
- 2019
Variational Homophilic Embedding is introduced, a fully generative model that learns network embeddings by modeling the semantic (textual) information with a variational autoencoder, while accounting for the structural information through a novel homophilic prior design.
Simplifying Graph Convolutional Networks
- Computer ScienceICML
- 2019
This paper successively removes nonlinearities and collapsing weight matrices between consecutive layers, and theoretically analyze the resulting linear model and show that it corresponds to a fixed low-pass filter followed by a linear classifier.
Attention is All you Need
- Computer ScienceNIPS
- 2017
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Inductive Representation Learning on Large Graphs
- Computer ScienceNIPS
- 2017
GraphSAGE is presented, a general, inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data and outperforms strong baselines on three inductive node-classification benchmarks.
SciBERT: A Pretrained Language Model for Scientific Text
- Computer ScienceEMNLP
- 2019
SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks and demonstrates statistically significant improvements over BERT.