• Corpus ID: 59783989

Thai Part-of-speech Tagged Corpus: ORCHID

  title={Thai Part-of-speech Tagged Corpus: ORCHID},
  author={Virach Sornlertlamvanich and Nobuo Takahashi and Hitoshi Isahara},
This paper presents a procedure in building a Thai partof-speech (POS) tagged corpus, called ORCHID corpus. It is a collaboration project between Communications Research Laboratory (CRL) of Japan and National Electronics and Computer Technology Center (NECTEC) of Thailand, supported by Electrotechnical Laboratory (ETL) of Japan. We propose a new tagset based on the previous research on Thai parts-of-speech for using in a multi-lingual machine translation project. We mark the corpus in three… 

Figures and Tables from this paper

Building a Thai part-of-speech tagged corpus (ORCHID)

A new tagset is proposed, based on the results of a prior multilingual machine translation project, for a Thai part-of-speech (POS) tagged corpus, which is a preliminary stage in the construction of a Thai speech corpus.

Phonetically Distributed Continuous Speech Corpus for Thai Language

This paper proposes a work on phonetically balanced sentence (PB) and phonetically distributed sentence (PD) set, which are parts of the text prompt for speech recording in Large Vocabulary

Automatic Annotation Inconsistency Detection: an n-Gram-based Approach

A method to detect potential annotation inconsistency in monolingual corpora and to incorporate corpora with different versions of part-of-speech tag sets, by automatically providing list of potential inconsistency.

ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation

The experimental results demonstrate that applying the LM always leads to a performance gain, especially when the amount of labeled data is small, and ThaiLMCut can outperform other open source state-of-the-art models achieving an F1 Score of 98.78% on the standard benchmark, InterBEST2009.

Speech Technology and Corpus Development in Thailand

This paper describes some recent activities on speech technology and corpus development in Thailand, where many speech corpus projects have been launched this year and many speech-technology researches are also discussed.

Toward benchmarking a general-domain Thai LVCSR System

This paper conducted a set of experiments as an initial attempt to benchmark the performance of a general domain Thai LVCSR system using the LOTUS speech corpus and found that using additional data from a large text corpus help improve the recognition performance of the LV CSR system.

Implementation and evaluation of an HMM-based Thai speech synthesis system

The evaluation of the synthesized speech shows that tone correctness is significantly improved in some clustering styles, and the implemented system gives the better reproduction of prosody (or naturalness, in some sense) than the unit-selection-based system with the same speech database.

Design of tree-based context clustering for an HMM-based Thai speech synthesis system

The evaluation of syllable duration distor ti n shows that the constancy-basedtoneseparated and the trend-based-tone-separated tree s tructures can alleviate the distortions that appear when usin g the simple tone-separation tree structure.

Thai Speech Recognition Corpora

The speech corpus (ORCHID-SPEECH CORPUS and NECTEC-ATR Thai speech corpus), which is developed for Thai speech recognition, is described and how the speech corpus is built in order to preserve important properties: consistency, balance, and containing possible phoneme combinations is indicated.

Implementing Thai text-to-speech synthesis for hand-held devices

Experimental results show that computational requirements can be reduced by shorten the length of synthesis filter impulse response and the dimension of the feature vectors to some degrees without sacrificing the synthetic speech quality.



Building a large Thai text corpus - part of speech tagged corpus: ORCHID

A new tagset based on the previous research on Thai parts-of-speech for using in a multi-lingual machine translation project named ORCHID is proposed, with a probabilistic trigram model for simultaneously word segmenting and POS tagging.

The Automatic Extraction of Open Compounds from Text Corpora

This paper describes a new method for extracting open compounds (uninterrupted sequences of words) from text corpora of languages, such as Thai, Japanese and Korea that exhibit unexplicit word segmentation, and extracts the strings that experience a significant change in frequency of occurrence when their length is extended.

Classifier Assignment by Corpus-Based Approach

A corpus-based method is proposed which generates Noun Classifier Associations (NCA) to overcome the problems in classifier assignment and semantic construction of noun phrase.

A Stochastic Japanese Morphological Analyzer Using a Forward-DP Backward-A* N-Best Search Algorithm

A novel method for segmenting the input sentence into words and assigning parts of speech to the words and an efficient two-pass N-best search algorithm is presented, suitable for written Japanese.

A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text

A program that tags each word in an input sentence with the most likely part of speech has been written and performance is encouraging; a 400-word sample is presented and is judged to be 99.5% correct.

A Practical Part-of-Speech Tagger

An implementation of a part-of-speech tagger based on a hidden Markov model that enables robust and accurate tagging with few resource requirements and accuracy exceeds 96%.

Thai Dictionary for Multi-lingual Machine Translation System

  • Proceedings of the Regional Workshop on Computer Processing of Asian Language (CPAL),
  • 1989

Profile of International R&D Cooperation Project on Multi-lingual Machine Translation (MMT) System

  • Proceedings of the Symposium on Multi-lingual Machine Translation for Asian Languages, Thailand MMT’95,
  • 1995