• Publications
  • Influence
Multiword Expressions: A Pain in the Neck for NLP
Multiword expressions are a key problem for the development of large-scale, linguistically sound natural language processing technology. This paper surveys the problem and some currently availableExpand
  • 1,047
  • 116
  • PDF
Automatic Evaluation of Topic Coherence
This paper introduces the novel task of topic coherence evaluation, whereby a set of words, as generated by a topic model, is rated for coherence or interpretability. We apply a range of topicExpand
  • 608
  • 61
  • PDF
SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles
This paper describes Task 5 of the Workshop on Semantic Evaluation 2010 (SemEval-2010). Systems are to automatically assign keyphrases or keywords to given scientific articles. The participatingExpand
  • 286
  • 58
  • PDF
Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality
Topic models based on latent Dirichlet allocation and related methods are used in a range of user-focused tasks including document navigation and trend analysis, but evaluation of the intrinsicExpand
  • 288
  • 50
  • PDF
Text-Based Twitter User Geolocation Prediction
Geographical location is vital to geospatial applications like local search and event detection. In this paper, we investigate and improve on the task of text-based geolocation prediction of TwitterExpand
  • 239
  • 37
langid.py: An Off-the-shelf Language Identification Tool
We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 long-document datasets, and 2Expand
  • 433
  • 34
  • PDF
Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition
This paper presents the results of the two shared tasks associated with W-NUT 2015: (1) a text normalization task with 10 participants; and (2) a named entity tagging task with 8 participants. WeExpand
  • 137
  • 34
  • PDF
Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages andExpand
  • 480
  • 33
  • PDF
An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word2vec (Mikolov et al., 2013a) to learn document-level embeddings. Despite promising results in the original paper, others haveExpand
  • 331
  • 31
  • PDF
An Empirical Model of Multiword Expression Decomposability
This paper presents a construction-inspecific model of multiword expression decomposability based on latent semantic analysis. We use latent semantic analysis to determine the similarity between aExpand
  • 240
  • 26
  • PDF