VSEC: Transformer-based Model for Vietnamese Spelling Correction

  title={VSEC: Transformer-based Model for Vietnamese Spelling Correction},
  author={Dinh-Truong Do and Nguyen Ha Thanh and Thang Bui and Dinh-Hieu Vo},
Spelling error correction is one of topics which have a long history in natural language processing. Although previous studies have achieved remarkable results, challenges still exist. In the Vietnamese language, a state-of-the-art method for the task infers a syllable’s context from its adjacent syllables. The method’s accuracy can be unsatisfactory, however, because the model may lose the context if two (or more) spelling mistakes stand near each other. In this paper, we propose a novel… 

Understanding Tieq Viet with Deep Learning Models

A linguistic study called Tieq Viet, which was controversial among both researchers and society, is found to be a great example to demonstrate the ability of deep learning models to recover lost information.



Deep Learning Approach for Vietnamese Consonant Misspell Correction

A deep learning approach focusing on consonant misspell errors with superior accuracy compared to the existing methods is proposed, and the accuracy of the model makes a significant gapCompared to the current state-of-the-art model.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

OCR Error Correction for Unconstrained Vietnamese Handwritten Text

An automatic OCR post-processing model which comprises both error detection and error correction phases for OCR output texts of unconstrained Vietnamese handwriting is presented, which outperform those obtained by various recognition systems in the VOHTR2018 competition.

Normalization of Vietnamese Tweets on Twitter

A method that aims to normalize Vietnamese tweets by detecting non-standard words as well as spelling errors and correcting them is proposed, which combines a language model with dictionaries and Vietnamese vocabulary structures.

A spelling correction method and its application to an OCR system

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

  • Martin Reynaert
  • Computer Science
    International Journal on Document Analysis and Recognition (IJDAR)
  • 2010
The character confusion-based prototype of Text-Induced Corpus Clean-up is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents, showing that the system is not sensitive to domain variation.

OCRSpell: an interactive spelling correction system for OCR errors in text

A spelling correction system designed specifically for OCR-generated text that selects candidate words through the use of information gathered from multiple knowledge sources is described, based on static and dynamic device mappings, approximate string matching, and n-gram analysis.

Context-Dependent Sequence-to-Sequence Turkish Spelling Correction

  • Osman Büyük
  • Computer Science
    ACM Trans. Asian Low Resour. Lang. Inf. Process.
  • 2020
This article uses sequence-to-sequence (seq2seq) models for spelling correction in the agglutinative Turkish language to improve the baseline performance and observes that the proposed context-dependent model performs significantly better than the baseline system.

Non-words Spell Corrector of Social Media Data in Message Filtering Systems

We develop an extended version of spell checker and corrector to check non-word errors in social media datasets, which will be used in message filtering systems especially for cyberbullying

Probability scoring for spelling correction

This paper describes a new program, CORRECT, which takes words rejected by the Unix® SPELL program, proposes a list of candidate corrections, and sorts them by probability score, and finds that human judges were extremely reluctant to cast a vote given only the information available to the program.