• Publications
  • Influence
A Closer Look at Skip-gram Modelling
TLDR
The amount of extra training data required to achieve skip-gram coverage using standard adjacent tri-grams is determined, which is determined by computing all possible skip-rams in a training corpus and measuring how many adjacent (standard) n- grams these cover in test documents.
Unsupervised Anomaly Detection
TLDR
This paper shows several variants of an automatic technique for identifying an 'unusual' segment within a document, and considers texts which are unusual because of author, genre, topic or emotional tone, and shows substantial improvements over a baseline in all cases.
Towards the Orwellian Nightmare: Separation of Business and Personal Emails
TLDR
This paper describes the largest scale annotation project involving the Enron email corpus to date, which classified emails into the categories "Business" and "Personal", and then sub-categorised by type within these categories.
Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation
TLDR
A newly created subcorpus of the Enron emails is described which is suggested to be used to test techniqes for authorship attribution, and the application of three different classification methods to this task to present baseline results.
An Improved Hierarchical Bayesian Model of Language for Document Classification
TLDR
In the course of the paper, an approximate sampling distribution for word counts in documents is advocated, and the model's capacity to outperform both the simple multinomial and more recently proposed extensions on the classification task is demonstrated.
Sentiment Detection Using Lexically-Based Classifiers
TLDR
It is concluded that classifier choice plays at least as important a role as feature choice, and that in many cases word-based classifiers perform well on the sentiment detection task.
Another Look at the Data Sparsity Problem
TLDR
The paper shows that large portions of language will not be represented within even very large corpora, and confirms that more data is always better, but how much better is dependent upon a range of factors: the source of that additional data, the sources of the test documents, and how the language model is pruned to account for sampling errors and make computation reasonable.
Chinese Text Classification without Automatic Word Segmentation
TLDR
This paper tests the assumption that segmentation is a necessary step for authorship attribution and topic classification tasks in Chinese, and demonstrates that it is not, and shows that a naïve character bigram model of text performs as well as models generated using a state-of-the-art automatic segmenter.
Using a Probabilistic Model of Context to Detect Word Obfuscation
TLDR
A distributional model of word use and word meaning which is derived purely from a body of text is proposed, and then a measure of semantic relatedness between a word and its context is defined using the same model.
...
1
2
...