• Publications
  • Influence
A Closer Look at Skip-gram Modelling
TLDR
The amount of extra training data required to achieve skip-gram coverage using standard adjacent tri-grams is determined, which is determined by computing all possible skip-rams in a training corpus and measuring how many adjacent (standard) n- grams these cover in test documents.
Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval
We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to
Unsupervised Anomaly Detection
TLDR
This paper shows several variants of an automatic technique for identifying an 'unusual' segment within a document, and considers texts which are unusual because of author, genre, topic or emotional tone, and shows substantial improvements over a baseline in all cases.
Towards the Orwellian Nightmare: Separation of Business and Personal Emails
TLDR
This paper describes the largest scale annotation project involving the Enron email corpus to date, which classified emails into the categories "Business" and "Personal", and then sub-categorised by type within these categories.
Integrating Information to Bootstrap Information Extraction from Web Sites
TLDR
A methodology to learn to extract domain-specific information from large repositories (e.g. the Web) with minimum user intervention to bootstrap learning for simple Information Extraction (IE) methodologies.
Unsupervised detection of anomalous text
TLDR
This thesis describes work on the detection of anomalous material in text without the use of training data and identifies a novel method which performs consistently better than others and the features that contribute most to unsupervised anomaly detection.
Mining web sites using adaptive information extraction
TLDR
This paper presents a methodology that drastically reduce (or even remove) the amount of manual annotation required when annotating consistent sets of pages.
Another Look at the Data Sparsity Problem
TLDR
The paper shows that large portions of language will not be represented within even very large corpora, and confirms that more data is always better, but how much better is dependent upon a range of factors: the source of that additional data, the sources of the test documents, and how the language model is pruned to account for sampling errors and make computation reasonable.
Mining Web Sites Using Unsupervised Adaptive Information Extraction
TLDR
This paper presents a methodology that drastically reduce the amount of manual annotation required when annotating consistent sets of pages, using an application of IE from Computer Science Web sites as a matter of exemplification.
Methods for Collection and Evaluation of Comparable Documents
TLDR
This chapter describes the work in developing methods for automatically gathering comparable corpora from the Web, specifically for under resourced languages, and an evaluation method is developed to assess the quality of the retrieved documents.
...
1
2
...