Share This Author
Multiword Expressions: A Pain in the Neck for NLP
- I. Sag, Timothy Baldwin, Francis Bond, Ann A. Copestake, D. Flickinger
- LinguisticsConference on Intelligent Text Processing and…
- 17 February 2002
The various kinds of multiword expressions should be analyzed in distinct ways, including listing "words with spaces", hierarchically organized lexicons, restricted combinatoric rules, lexical selection, "idiomatic constructions" and simple statistical affinity.
Automatic Evaluation of Topic Coherence
- D. Newman, Jey Han Lau, Karl Grieser, Timothy Baldwin
- Computer ScienceNorth American Chapter of the Association for…
- 2 June 2010
A simple co-occurrence measure based on pointwise mutual information over Wikipedia data is able to achieve results for the task at or nearing the level of inter-annotator correlation, and that other Wikipedia-based lexical relatedness methods also achieve strong results.
Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality
- Jey Han Lau, D. Newman, Timothy Baldwin
- Computer ScienceConference of the European Chapter of the…
- 1 April 2014
This work explores the two tasks of automatic Evaluation of single topics and automatic evaluation of whole topic models, and provides recommendations on the best strategy for performing the two task, in addition to providing an open-source toolkit for topic and topic model evaluation.
SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles
- Su Nam Kim, Olena Medelyan, Min-Yen Kan, Timothy Baldwin
- Computer ScienceInternational Workshop on Semantic Evaluation
- 15 July 2010
The participating systems were evaluated by matching their extracted keyphrases against manually assigned ones and the overall ranking of the submitted systems is presented.
An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
It is found that doc2vec performs robustly when using models trained on large external corpora, and can be further improved by using pre-trained word embeddings.
langid.py: An Off-the-shelf Language Identification Tool
It is found that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.
Text-Based Twitter User Geolocation Prediction
This paper presents an integrated geolocation prediction framework, and evaluates the impact of nongeotagged tweets, language, and user-declared metadata on geolocated prediction, and discusses how users differ in terms of their geolocatability.
Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition
- Timothy Baldwin, Marie-Catherine de Marneffe, Bo Han, Young-Bum Kim, Alan Ritter, Wei Xu
- Computer Science, PsychologyNUT@IJCNLP
- 1 August 2015
The task, annotation process and dataset statistics are outlined, and a high-level overview of the participating systems for each shared task is provided.
Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
This paper targets out-of-vocabulary words in short text messages and proposes a method for identifying and normalising ill-formed words, which achieves state- of-the-art performance over an SMS corpus and a novel dataset based on Twitter.