Our Work
Semantic Scholar Publications
We are an interdisciplinary research team focused on AI, HCI, ML, NLP, accessibility and computational social science in support of Semantic Scholar's mission of accelerating science. Our team is part of the Allen Institute for AI, a nonprofit research institute advancing AI for the common good.
Follow us on Twitter for research updates!
This work proposes caching an intermediate layer’s output from a pretrained model and finetuning the remaining layers for new tasks, and shows that this method provides a 100% speedup during training and a 55-86% speed up for inference, and has negligible impacts on accuracy for text classification and entity recognition tasks in the scienti-c domain.
CiteSee is a paper reading tool that leverages a user's publishing, reading, and saving activities to provide personalized visual augmentations and context around citations to help users prioritize their exploration.
In order to help scholars understand and follow a research topic, significant research has been devoted to creating systems that help schola...
This work designs a system, Relatedly, that scaffolds exploring and reading multiple related work paragraphs on a topic, with features including dynamic re-ranking and highlighting to spotlight unexplored dissimilar information, auto-generated descriptive paragraph headings, and low-lighting of redundant information.
This paper combines public and proprietary data sources using state-of-theart techniques for scholarly PDF content extraction and automatic knowledge graph construction to build the Semantic Scholar Academic Graph, the largest open scientific literature graph to-date.
The key intellectual question is whether it is possible, if at all, to design a learning algorithm that does not benefit from scale, yet leads to a competitive level of commonsense acquisition.
This work considers design choices for the annotation interface used to elicit human judgments and their impact on reproducibility, and develops an automated mechanism for maintaining annotator quality via a probabilistic model that detects and excludes noisy annotators.
This paper proposes to train a GenQA model by transferring knowledge from a trained AS2 model, and proposes to use the As2 model prediction scores for loss weighting and score-conditioned input/output shaping, to aid the knowledge transfer.
This paper proposes three novel sentence-level transformer pre-training objectives that incorporate paragraph-level semantics within and across documents, to improve the performance of transformers for AS2, and mitigate the requirement of large labeled datasets.
This paper proposes a Multiple Heads Student architecture (named CERBERUS), an efficient neural network designed to distill an ensemble of large transformers into a single smaller model, rivaling the state-of-the-art large AS2 models that have 2.7x more parameters and run 2x slower.
It is shown how state-of-the-art models struggle to generalize across task formats, and that simple multi-task training fails to improve them, and a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance.
This paper introduces G EN -T Y D I QA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian questions and presents the first Cross-Lingual answer sentence generation system (C ROSS -L INGUAL G EN QA).
BLOOM is a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers and achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning.
It may be possible for readers of all abilities to organically leave traces in papers, and that these traces can be used to facilitate navigation tasks, in particular for low-vision readers.
A tool integrated into users’ reading process that helps them with leveraging authors’ existing summarization of threads, typically in introduction or related work sections, in order to situate their own work’s contributions is developed.
This work introduces a new technique, polymorphic lenses, that improves exploratory search over a KG by obtaining new leverage from the existing preference models that KG-based systems maintain for recommending content.
SciFact-Open is presented, a new test collection designed to evaluate the performance of scientific claim verification systems on a corpus of 500K research abstracts, and it is found that systems developed on smaller corpora struggle to generalize to SciFact- open, exhibiting performance drops of at least 15 F1.
This work presents M ULTI V ER S, which predicts a fact-checking label and identifies rationales in a multitask fashion based on a shared encoding of the claim and full document context, which allows the model to perform weakly-supervised domain adaptation by training on scientific documents labeled using high-precision heuristics.
This work identifies the right prompting approach by extensively exploring natural language prompts on FEB and demonstrates that making progress on few-shot self-rationalization is possible, and presents FEB—a stan-dardized collection of four existing English-language datasets and associated metrics.
This work proposes a novel method for equipping long-context QA models with an additional sequence-level objective for better identification of the supporting evidence, via an additional contrastive supervision signal in finetuning.
A novel system that automatically retrieves patient-specific literature based on intensive care (ICU) patient information, aggregates relevant papers and fuses them with internal admission notes to form outcome predictions, which is able to substantially boost predictive accuracy on three challenging tasks in comparison to strong recent baselines.
This paper shows that popular pre-trained transformers perform poorly when used for fine-tuning on multi-candidate inference tasks, and proposes a new pre-training objective that models the paragraph-level semantics across multiple input sentences.
Multi-LexSum, a collection of 9,280 expert-authored summaries drawn from ongoing CRLC writing, is introduced, demonstrating that despite the high-quality summaries in the training data, state-of-the-art summarization models perform poorly on this task.
The framework presented is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.
We present Aspire, a new scientific document similarity model based on matching fine-grained aspects.
We introduce new methods for incorporating VIsual LAyout (VILA) structures, e.g., the grouping of page texts into text lines or text blocks, into language models to further improve performance on automated scientific document understanding.
Grounding model predictions in clinically-relevant symptoms can improve generalizability while producing a model that is easier to inspect, and this approach can still perform competitively on in-domain data.
This tutorial aims at bringing interested NLP researchers up to speed about the recent and ongoing techniques for zero- and few-shot learning with pretrained language models.
A novel framework to generate pragmatically relevant true and false instances of a generic, which outperforms few-shot generation from GPT-3 and high-lights the importance of constrained decoding for this task and the implications of generics EXEMPLARS for language inference tasks.
This work proposes scientific claim generation, the task of generating one or more atomic and verifiable claims from scientific sentences, and demonstrates its usefulness in zero-shot fact checking for biomedical claims, and proposes CLAIMGEN-BART, a new supervised method for generating claims supported by the literature, as well as KBIN, a novel methods for generating claim negations.
ACCoRD, an end-to-end system tack-ling the novel task of generating sets of descriptions of scientific concepts, is presented and a user study is conducted demonstrating that users prefer descriptions produced by the system, and users prefer multiple descriptions to a single “best” description.
Scim is presented, an AI-augmented reading interface designed to help researchers skim papers by automatically identifying, classifying, and highlighting salient sentences, organized into rhetorical facets rooted in common information needs.
PRIMERA is introduced, a pre-trained model for multi-document representation with a focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data.
This work proposes a novel computational representation that automatically breaks up products into fine-grained functional facets, and designs similarity metrics that support granular matching between functional facets across ideas, and uses them to build a novel functional search capability that enables expressive queries for mechanisms and purposes.
This work introduces multiple new methods for augmenting recommendations with textual relevance messages that highlight knowledge-graph connections between recommended papers and a user’s publication and interaction history and develops a novel method that highlights connections with proxy authors of interest to users.
This work contributes two datasets to the study of mentorship, one of which has over 300,000 ground truth academic mentor-mentee pairs obtained from multiple diverse, manually-curated sources, and linked to the Semantic Scholar (S2) knowledge graph.
We construct a faceted representation of authors with information gleaned from their papers and inferred author personas, and use it to develop an approach that locates commonalities ("bridges") and contrasts between scientists. This approach helps users discover authors considered useful for generating novel research directions.
A National Science Foundation Convergence Accelerator project is described to build a set of Knowledge Network Programming Infrastructure systems to address the issue of frustratingly slow building, using, and scaling large knowledge networks.
A novel paper reading experience that integrates relevant information about follow-on work directly into a paper, allowing readers to learn about newer papers and see how a paper is discussed by its citing papers in the context of the reference paper.
PINOCCHIO is presented, a new decoding method that improves the consistency of a transformer-based abstractive summarizer by constraining beam search to avoid hallucinations.
This paper introduces LIMEADE, the first general framework that translates both positive and negative advice into an update to an arbitrary, underlying opaque model, and shows that the framework leads to higher perceived user control, trust, and satisfaction.
To improve access to medical papers, we introduce a novel interactive interface-Paper Plain-with four features powered by natural language processing: definitions of unfamiliar terms, in-situ plain language section summaries, a collection of key questions that guide readers to answering passages, and plain language summaries of the answering passages.
Our goal is to bolster the ability of researchers and clinicians to keep track of difficulties, limitations and emerging hypotheses.
Few-shot NLP research lacks a unified, challenging-yet-realistic evaluation setup. In response, we introduce FLEX, a rigorous few-shot learning NLP benchmark and public leaderboard measuring four transfer types. We also present UniFew, a simple, competitive baseline that does not rely on heavy prompt engineering or complex meta-learning methods.
This paper proposes generating personalized scientific concept descriptions that are tailored to the user’s expertise and context and outlines a complete architecture for the task and releases an expert-annotated resource, ACCoRD.
This work releases MS^2 (Multi-Document Summarization of Medical Studies ), a dataset of over 470k documents and 20K summaries derived from the scientific literature that facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies , and is the first large-scale, publicly available multi-document summarization dataset in the biomedical domain.
A new pretrained language model for cross document tasks.
We present SciA11y, a system that renders inaccessible scientific paper PDFs into HTML.
Integrating scientific language models and graph embeddings for boosting drug discovery.
An extension of cross-document coreference with a referential hierarchy over mention clusters, in the scientific document domain. New task, dataset and models with applications in faceted document retrieval and knowledge base construction.
In response to this challenge, we present S2AND, a unified benchmark dataset for AND on scholarly papers, as well as an open-source reference model implementation.
PAWLS is a new annotation tool designed specifically for the PDF document format. PAWLS supports span-based textual annotation, N-ary relations and freeform, non-textual bounding boxes, all of which can be exported in convenient formats for training multi-modal machine learning models.
We address the task of citation text generation: given a pair of scientific documents, explain their relationship in natural language text in the manner of a citation from one text to the other.
We introduce ParsiNLU, the first benchmark in Persian language that includes a range of high-level tasks -- Reading Comprehension, Textual Entailment, etc. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers.
We highlight three understudied phenomena for citation context analysis and release MultiCite, a new dataset of 12.6K citation contexts from 1.2K computational linguistics papers that fully models these phenomena.
We present an overview of the SCIVER shared task. In addition to surveying the participating systems, we provide several insights into modeling approaches to support continued progress and future research on scientific claim verification.
To navigate the collection of COVID19 papers from different domains, we present a KB of mechanisms relating to COVID19, to support domain-agnostic search and exploration of general activities, functions, influences and associations in these papers.
Qasper is a dataset of 5049 questions over 1585 NLP papers designed to facilitate document-grounded, information-seeking QA. Existing models that do well on other QA tasks do not perform well on these questions.
A new robust and lightweight tool for acquiring, managing, and performing typical operations over datasets used in IR, primarily focus on textual datasets used for ad-hoc search.
This work conducts mixed-method user studies on three datasets, where an AI with accuracy comparable to humans helps participants solve a task (explaining itself in some conditions), and observes complementary improvements from AI augmentation that were not increased by explanations.
We introduce ScholarPhi, an augmented reading interface that brings definitions of technical terms and symbols to readers when and where they need them most.
Accessibility research has grown substantially in the past few decades, yet there has been no literature review of the field. To understand current and historical trends, we created and analyzed a dataset of accessibility papers appearing at CHI and ASSETS since ASSETS' founding in 1994.
CODE introduces neuron-level analyses and transformations aimed at identifying and removing redundant computation from the networks that compose the ensemble that enables CODE to train large DNN ensembles in a fraction of the time and memory footprint needed by current techniques.
This paper provides a comprehensive overview of the structure and results of TREC-COVID, an information retrieval (IR) shared task to evaluate search on scientific literature related to COVID-19.
The majority of scientific papers are distributed in PDF, which pose challenges for accessibility, especially for blind and low vision (BLV) readers. We characterize the scope of this problem...
An open-source library for streamlining the usage of deep learning in document image analysis research and applications.
An analysis of 2.87 million computer science papers reveals that, if current trends continue, parity between the number of male and female authors will not be reached in this century. With optimistic projection models, gender parity is forecast to be reached by 2100 in CS, but projected to be reached within two to three decades in the biomedical literature.
In this paper, we present a new method for generating extended summaries of long papers.
It is argued that AI systems should be trained in a human-centered manner, directly optimized for team performance, and the benefit of modeling teamwork during training is shown through improvements in expected team utility across datasets, considering parameters such as human skill and the cost of mistakes.
This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms and presents both manual and automatic metrics on the leaderboard.
This review discusses the corpora, modeling resources, systems and shared tasks that have been introduced for COVID-19, and lists 39 systems that provide functionality such as search, discovery, visualization and summarization over the COVID-19 literature.
The results suggest that while CORD-19 exhibits a strong tilt toward recent and topically focused articles, the knowledge being explored to attack the pandemic encompasses a much longer time span and is very interdisciplinary.
This work adapts the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter to Python, ported to Python with additional improvements and functionality.
The task of definition detection is important for scholarly papers, because papers often make use of technical terminology that may be unfamiliar to readers. We develop a new definition detection system, HEDDEx, that utilizes syntactic features, transformer encoders, and heuristic filters, and evaluate it on a standard sentence-level benchmark.
We construct SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales. We develop baseline models for SciFact, and demonstrate that these models benefit from combined training on a large dataset of claims about Wikipedia articles, together with the new SciFact data.
We introduce TLDR generation for scientific papers, a new automatic summarization task with high source compression and provide a new dataset and models for effective generation of TLDRs.
SciSight is a novel framework for exploratory search of COVID-19 research that integrates two key capabilities: first, exploring interactions between biomedical facets (e.g., proteins, genes, drugs, diseases, patient characteristics); and second, discovering groups of researchers and how they are connected.
We present a zero-shot ranking algorithm that adapts to COVID-related scientific literature. Our approach filters training data from another collection down to medical-related queries, uses a neural reranking model pre-trained on scientific text (SciBERT), and filters the target document collection.
To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context.
A new comprehensive framework for Analyzing the Behavior of Neural IR ModeLs (ABNIRML), which includes new types of diagnostic tests that allow us to probe several characteristics---such as sensitivity to word order---that are not addressed by previous techniques.
This work investigates G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting, and demonstrates that it produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.
Ontologies are critical to support the types of big data analysis necessary for kidney precision medicine, where heterogeneous clinical, imaging and biopsy data from diverse sources must be combined to define a patient's phenotype.
A novel, unsupervised method for extracting scientific concepts from papers, based on the intuition that each scientific concept is likely to be introduced or popularized by a single paper that is disproportionately cited by subsequent papers mentioning the concept.
The Covid-19 Open Research Dataset (CORD-19) is a growing 1 resource of scientific papers on Covid-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers.
SUPP.AI is an attempt to close the information gap on dietary supplements by making up-to-date evidence on SDIs more discoverable for researchers, clinicians, and consumers.
We show that the softmax output common in neural language models leads to a limitation: some words (in particular, those with an embedding interior to the convex hull of the embedding space) can never be assigned high probability by the model, no matter what the context.
We introduce a new dataset called SciREX that requires understanding of the whole document to annotate entities, and their document-level relationships that usually span beyond sentences or even sections.
This work proposes SPECTER, a new method to generate document-level embedding of scientific papers based on pretraining a Transformer language model on a powerful signal of document- level relatedness: the citation graph, and shows that Specter outperforms a variety of competitive baselines on the benchmark.
We introduce S2ORC, a large contextual citation graph of English-language academic papers from multiple scientific domains; the corpus consists of 81.1M papers, 380.5M citation edges, and associated paper metadata.
We bring together ideas from cognitive science and AI/NLU, arguing that grounding by analogical inference and executable simulation will greatly benefit NLU systems. We propose a system architecture along with a roadmap towards realizing this vision.
TREC-COVID is a community evaluation designed to build a test collection that captures the information needs of biomedical researchers using the scientific literature during a pandemic.
This article presents a brief description of the rationale and structure of TREC-COVID, a still-ongoing IR evaluation. TREC-COVID is creating a new paradigm for search evaluation in rapidly evolving crisis scenarios.
A novel ranking approach, consisting of textual and ontological overlaps between the preliminary and final versions of reports, is proposed, which allows medical practitioners to easily identify and learn from the reports in which their interpretation most substantially differed from that of the attending physician.
We introduce the Longformer, with an attention mechanism that scales linearly with sequence length, achieving state-of-the-art results on multiple character-level language modeling and document-level tasks.
We present a model based on pretrained language models for classifying sentences in context of other sentences. Achieves SOTA results on 4 datasets on 2 different domains. We also release a challenging dataset of 2K discourse facets in CS domain.
The approach extends BERT by masking contiguous random spans, rather than random tokens, and training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it.
SciBERT is a pretrained language model for scientific text.
We present a model based on pretrained language models for classifying sentences in context of other sentences. Achieves SOTA results on 4 datasets on 2 different domains. We also release a challenging dataset of 2K discourse facets in CS domain.
The basic elements of GrapAL are described, how to use it, and several use cases such as finding experts on a given topic for peer reviewing, discovering indirect connections between biomedical entities, and computing citation-based metrics are described.