• Publications
  • Influence
Multiword Expressions: A Pain in the Neck for NLP
TLDR
The various kinds of multiword expressions should be analyzed in distinct ways, including listing "words with spaces", hierarchically organized lexicons, restricted combinatoric rules, lexical selection, "idiomatic constructions" and simple statistical affinity. Expand
Automatic Evaluation of Topic Coherence
TLDR
A simple co-occurrence measure based on pointwise mutual information over Wikipedia data is able to achieve results for the task at or nearing the level of inter-annotator correlation, and that other Wikipedia-based lexical relatedness methods also achieve strong results. Expand
SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles
TLDR
The participating systems were evaluated by matching their extracted keyphrases against manually assigned ones and the overall ranking of the submitted systems is presented. Expand
Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality
TLDR
This work explores the two tasks of automatic Evaluation of single topics and automatic evaluation of whole topic models, and provides recommendations on the best strategy for performing the two task, in addition to providing an open-source toolkit for topic and topic model evaluation. Expand
langid.py: An Off-the-shelf Language Identification Tool
TLDR
It is found that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data. Expand
An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
TLDR
It is found that doc2vec performs robustly when using models trained on large external corpora, and can be further improved by using pre-trained word embeddings. Expand
Text-Based Twitter User Geolocation Prediction
TLDR
This paper presents an integrated geolocation prediction framework, and evaluates the impact of nongeotagged tweets, language, and user-declared metadata on geolocated prediction, and discusses how users differ in terms of their geolocatability. Expand
Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition
TLDR
The task, annotation process and dataset statistics are outlined, and a high-level overview of the participating systems for each shared task is provided. Expand
Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
TLDR
This paper targets out-of-vocabulary words in short text messages and proposes a method for identifying and normalising ill-formed words, which achieves state- of-the-art performance over an SMS corpus and a novel dataset based on Twitter. Expand
An Empirical Model of Multiword Expression Decomposability
TLDR
A construction-inspecific model of multiword expression decomposability based on latent semantic analysis is presented, and evidence is furnished for the calculated similarities being correlated with the semantic relational content of WordNet. Expand
...
1
2
3
4
5
...