Comparing neural‐ and N‐gram‐based language models for word segmentation

@article{Doval2018ComparingNA,
  title={Comparing neural‐ and N‐gram‐based language models for word segmentation},
  author={Yerai Doval and Carlos G{\'o}mez-Rodr{\'i}guez},
  journal={Journal of the Association for Information Science and Technology},
  year={2018},
  volume={70},
  pages={187 - 197}
}
Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n‐gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character… 

MiNgMatch - A Fast N-gram Model for Word Segmentation of the Ainu Language

The MiNgMatch Segmenter is introduced—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text.

An Empirical Study on Efficiency of a Dictionary Based Viterbi Algorithm for Word Segmentation

An algorithm for segmenting English sentences, without spaces, into their constituent words based on a dictionary using a variation of the Viterbi algorithm called the Reverse Sequence Search (RSS) algorithm which runs in O(n) time and space.

Language Modelling for a Low-Resource Language in Sarawak, Malaysia

This paper explores state-of-the-art techniques for creating language models in low-resource setting by conducting a study on current language modelling techniques such as n-gram and recurrent neural network to observe their outcomes on data from a language in Sarawak, Malaysia.

An Efficient Minimal Text Segmentation Method for URL Domain Names

An efficient minimal text segmentation (EMTS) method for URL domain names to achieve efficient adaptive text mining and a novel method of integrating conflict game into the two-directional maximum matching algorithm to enhance the accuracy of recognition.

Towards robust word embeddings for noisy texts

This work proposes a simple extension to the skipgram model in which the concept of bridge-words are introduced, which are artificial words added to the model to strengthen the similarity between standard words and their noisy variants.

Hashtag Segmentation: A Comparative Study Involving the Viterbi, Triangular Matrix and Word Breaker Algorithms

The Word Breaker algorithm, which can ascertain the meaningful tokens in the form of words, before proceeding with the segmentation of the remaining characters, is considered superior to both the Viterbi and Triangular Matrix algorithms, particularly when it comes to the detection of unknown words.

Tokenization Repair in the Presence of Spelling Errors

This work identifies three key ingredients of high-quality tokenization repair, all missing from previous work: deep language models with a bidirectional component, training the models on text with spelling errors, and making use of the space information already present.

Data Augmentation Methods for Low-Resource Orthographic Syllabification

A new transposing nuclei-based augmentation method is proposed and combined with both flipping and swapping procedures to tackle the drawback of bigram syllabification in handling the OOV bigrams.

References

SHOWING 1-10 OF 57 REFERENCES

Character-Aware Neural Language Models

A simple neural language model that relies only on character-level inputs that is able to encode, from characters only, both semantic and orthographic information and suggests that on many languages, character inputs are sufficient for language modeling.

Neural Word Segmentation Learning for Chinese

A novel neural framework is proposed which thoroughly eliminates context windows and can utilize complete segmentation history and employs a gated combination neural network over characters to produce distributed representations of word candidates, which are then given to a long short-term memory (LSTM) language scoring model.

Word Segmentation In Sentence Analysis

A model of language processing where word segmentation is an integral part of sentence analysis and it is shown that the use of a parser can enable us to achieve the best ambiguity resolution inword segmentation.

Robust Segmentation of Japanese Text into a Lattice for Parsing

A segmentation component that utilizes minimal syntactic knowledge to produce a lattice of word candidates for a broad coverage Japanese NL parser that achieves a breaking accuracy of ~97% over a wide variety of corpora.

Deep Learning for Chinese Word Segmentation and POS Tagging

This study explores the feasibility of performing Chinese word segmentation and POS tagging by deep learning, and describes a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method to speed up the training process and make the learning algorithm easier to be implemented.

Word segmentation and recognition for web document framework

A maximal bi-directional matching algorithm with heuristic rules is used to resolve ambiguous segmentation and meaning in compound words and an adaptive training process is employed to build a dictionary of recognisable abbreviations and acronyms.

Gated Recursive Neural Network for Chinese Word Segmentation

A gated recursive neural network (GRNN) for Chinese word segmentation is proposed, which contains reset and update gates to incorporate the complicated combinations of the context characters.

Context dependent recurrent neural network language model

This paper improves recurrent neural network language models performance by providing a contextual real-valued input vector in association with each word to convey contextual information about the sentence being modeled by performing Latent Dirichlet Allocation using a block of preceding text.

Long Short-Term Memory Neural Networks for Chinese Word Segmentation

A novel neural network model for Chinese word segmentation is proposed, which adopts the long short-term memory (LSTM) neural network to keep the previous important information in memory cell and avoids the limit of window size of local context.

LSTM Neural Networks for Language Modeling

This work analyzes the Long Short-Term Memory neural network architecture on an English and a large French language modeling task and gains considerable improvements in WER on top of a state-of-the-art speech recognition system.
...