Another Look at the Data Sparsity Problem

  title={Another Look at the Data Sparsity Problem},
  author={Ben Allison and David Guthrie and Louise Guthrie},
Performance on a statistical language processing task relies upon accurate information being found in a corpus However, it is known (and this paper will confirm) that many perfectly valid word sequences do not appear in training corpora The percentage of n-grams in a test document which are seen in a training corpus is defined as n-gram coverage, and work in the speech processing community [7] has shown that there is a correlation between n-gram coverage and word error rate (WER) on a speech… 
Structured Language Modeling for Automatic Speech Recognition
Probabilistic lexicalized tree-insertion grammars (PLTIGs) are evaluated on a classification task relevant for automatic speech recognition. The baseline is a family of n-gram models tuned with
An Improved Hierarchical Word Sequence Language Model Using Word Association
The basic HWS approach is improved upon by generalizing it to exploit not only word frequencies but word association, and both intrinsic and extrinsic experiments verify that word association based HWS models can achieve better performance.
The effect of word similarity on N-gram language models in Northern and Southern Dutch
This paper examines several combinations of classical N-gram language models with more advanced and well known techniques based on word similarity such as cache models and Latent Semantic Analysis and finds that a linear interpolation of a 3-gram, a cache model and a continuous skip-gram is capable of reducing perplexity by up to 18.63%, compared to a3-gram baseline.
Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information
A new context-sensitive spelling correction method for detecting and correcting non-word and real-word errors in digital text documents, based on data statistics from Google Web 1T 5-gram data set which consists of a big volume of n-gram word sequences, extracted from the World Wide Web.
Noise Robust Automatic Speech Recognition Based on Spectro-Temporal Techniques
This study examines many methods to improve the robustness of automatic speech recognition based on human speech perception and demonstrates that they can provide results that are competitive with the state-of-the-art results.
Parallel Spell-Checking Algorithm Based on Yahoo! N-Grams Dataset
A new parallel shared-memory spell-checking algorithm that uses rich real-world word statistics from Yahoo! N-Grams Dataset to correct non-word and real-word errors in computer text is proposed.
Probabilistic Lexicalized Tree-Insertion Grammars in Automatic Speech Recognition
This work evaluates probabilistic lexicalized tree-insertion grammars (PLTIGs) on a classification task relevant for automatic speech recognition, and finds that the N -gram model preferred one of these alternative sentences in 43.1 percent of the cases, while the PLTIG was only mistaken in 3 percent.
Common Topic Identification in Online Maltese News Portal Comments
The results obtained indicate that the majority of comments follow a political theme related either to party politics, foreign politics, corruption, issues of an ideological nature, or other issues.
Towards Analysing the Sentiments in the Field of Automobile with Specific Focus on Arabic Language Text
This research attempts to cover gaps in sentiment analysis on automobile reviews in Gulf Cooperation Council (GCC) dialects and in the Arabic language in general by analyzing the sentiments in Arabic regional dialects using annotated datasets.


Optimizing lexical and N-gram coverage via judicious use of linguistic data
I study the effect of various types and amounts of North American Business language data on the quality of the derived vocabulary, and use my findings to derive an improved ranking of the words,
Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing
In this paper, we discuss experiments applying machine learning techniques to the task of confusion set disambiguation, using three orders of magnitude more training data than has previously been
Introducing the Enron Corpus
The goal in this paper is to analyze the suitability of this corpus for exploring how to classify messages as organized by a human, so these folders would have likely been misleading.
An empirical study of smoothing techniques for language modeling
A survey of the most widely-used algorithms for smoothing models for language n -gram modeling and an extensive empirical comparison of several of these smoothing techniques are presented.
Foundations of Statistical Natural Language Processing
  • P. Kantor
  • Computer Science
    Information Retrieval
  • 2004
Mitigating the Paucity of Data Problem
There’s No Data Like More Data (But When Will Enough Be Enough?)
  • In Proceedings of IEEE International Workshop on Intelligent Signal Processing
  • 2001
Up from trigrams
  • In Proceedings Eurospeech
  • 1991
The Anarchist’s Cookbook
  • Ozark Pr Llc
  • 1970