• Publications
  • Influence
Construction of the Literature Graph in Semantic Scholar
TLDR
This paper reduces literature graph construction into familiar NLP tasks, point out research challenges due to differences from standard formulations of these tasks, and report empirical results for each task. Expand
Content-Based Citation Recommendation
TLDR
It is shown empirically that, although adding metadata improves the performance on standard metrics, it favors self-citations which are less useful in a citation recommendation setup and released an online portal for citation recommendation based on this method. Expand
Part-of-speech histograms for genre classification of text
TLDR
This work proposes statistics of POS histograms as classification features, coupled with a quadratic discriminant classifier, to address the problem of classifying the genre of text, which is useful for a variety of language processing problems. Expand
Completely Lazy Learning
TLDR
This work proposes a simple alternative to cross validation of the neighborhood size that requires no preprocessing: instead of committing to one neighborhood size, average the discriminants for multiple neighborhoods so that similar classification performance can be attained without any training. Expand
SPECTER: Document-level Representation Learning using Citation-informed Transformers
TLDR
This work proposes SPECTER, a new method to generate document-level embedding of scientific papers based on pretraining a Transformer language model on a powerful signal of document- level relatedness: the citation graph, and shows that Specter outperforms a variety of competitive baselines on the benchmark. Expand
Multi-Task Averaging
TLDR
Simulations and real data experiments demonstrate that MTA outperforms both maximum likelihood and James-Stein estimators, and that the approach to estimating the amount of regularization rivals cross-validation in performance but is more computationally efficient. Expand
Classifying Factored Genres with Part-of-Speech Histograms
This work addresses the problem of genre classification of text and speech transcripts, with the goal of handling genres not seen in training. Two frameworks employing different statistics onExpand
Revisiting Stein's paradox: multi-task averaging
TLDR
The proposed multi-task averaging (MTA) algorithm results in a convex combination of the individual task's sample averages, and the optimal amount of regularization for the two task case is derived for the minimum risk estimator and a minimax estimator. Expand
Quantifying Sex Bias in Clinical Studies at Scale With Automated Data Extraction
TLDR
It is suggested that sex bias against female participants in clinical studies persists, but results differ when studies vs participants are the measurement units. Expand
Precursor charge state prediction for electron transfer dissociation tandem mass spectra.
TLDR
An ETD charge state prediction tool based on support vector machine classifiers that is demonstrated to exhibit superior classification accuracy while minimizing the overall number of predicted charge states. Expand
...
1
2
...