IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus

@inproceedings{Larasati2012IDENTICCM,
  title={IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus},
  author={Septina Dian Larasati},
  booktitle={LREC},
  year={2012}
}
This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The corpus contains 45,000 sentences collected from different sources in different genres. Several manual text preprocessing tasks, such as alignment and spelling correction, are applied to the corpus to assure its quality. We also apply language specific text processing such as tokenization on both sides and clitic normalization on the Indonesian side. The corpus is available in two different formats… CONTINUE READING

Figures, Tables, and Topics from this paper.

Citations

Publications citing this paper.
SHOWING 1-10 OF 14 CITATIONS

COMPARATIVE STUDY OF SMOOTHING TECHNIQUES ON INDONESIAN AND ENGLISH LANGUAGE MODELS

  • 2014
VIEW 8 EXCERPTS
CITES BACKGROUND & METHODS
HIGHLY INFLUENCED

Evaluating the use of word embeddings for part-of-speech tagging in Bahasa Indonesia

  • 2016 International Conference on Computer, Control, Informatics and its Applications (IC3INA)
  • 2016

Comparison of Modified Kneser-Ney and Witten-Bell smoothing techniques in statistical language model of Bahasa Indonesia

  • 2014 2nd International Conference on Information and Communication Technology (ICoICT)
  • 2014
VIEW 1 EXCERPT
CITES BACKGROUND

Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus

  • 2014 International Conference on Asian Language Processing (IALP)
  • 2014
VIEW 1 EXCERPT
CITES METHODS