Towards the Semantic Text Retrieval for Indonesian

Abstract

Indonesia is the fourth most populous country in the world and the Asosiasi Penyelenggara Jasa Internet Indonesia (Indonesian Internet Service Providers Association) recorded that Indonesian Internet subscribers and users has been growing rapidly every year. These facts should encourage research such as computer linguistic and information retrieval for Indonesian language which in fact has not been extensively investigated. The research aims to investigate the tolerance rough sets model (TRSM) in order to propose a framework for a semantic text retrieval system. The proposed framework is intended for Indonesian language specifically hence we are working with Indonesian corpora and applying tools for Indonesian, e.g. Indonesian stemmer, in all of the studies. Cognitive approach is employed particularly during data preparation and analysis. An extensive collaboration with human experts is significant on creating a new Indonesian corpus suitable for our research. The performance of an ad hoc retrieval system becomes the starting point for further analysis in order to learn and understand more about the process and characteristic of TRSM, despite comparing TRSM with other methods and determining the best solution. The results of this process function as the guidance for computational modeling of some TRSM’s tasks and finally the framework of a semantic information retrieval system with TRSM as its heart. In addition to the proposed framework, this thesis proposes three methods based on TRSM, which are the automatic tolerance value generator, thesaurus optimization, and lexicon-based document representation. All methods were developed by the use of our own corpus, namely ICL-corpus, and evaluated by employing an available Indonesian corpus, called Kompascorpus. The evaluation on the methods achieved satisfactory results, except for the compact document representation method; this last method seems to work only in limited domain.

Extracted Key Phrases

64 Figures and Tables

Cite this paper

@inproceedings{Nguyen2012TowardsTS, title={Towards the Semantic Text Retrieval for Indonesian}, author={Hung Son Nguyen}, year={2012} }