Learn More
This paper, submitted as an entry for the NERSSEAL-2008 shared task, describes a system build for Named Entity Recognition for South and South East Asian Languages. Our paper combines machine learning techniques with language specific heuris-tics to model the problem of NER for In-dian languages. The system has been tested on five languages: Telugu, Hindi,(More)
This paper is an attempt to show that an intermediary level of analysis is an effective way for carrying out various NLP tasks for linguistically similar languages. We describe a process for developing a simple parser for doing such tasks. This parser uses a grammar driven approach to annotate dependency relations (both inter and intra chunk) at an(More)
Part-Of-Speech (POS) tagging is defined as the Natural Language Processing (NLP) task in which each word in a sentence is labeled with a tag indicating its appropriate part of speech. Of the entire supervised machine learning classification algorithms, second order Hidden Markov Model (HMM) and Conditional Random Fields (CRF) is chosen in this work for POS(More)
In this paper we explore various parameter settings of the state-of-art Statistical Machine Translation system to improve the quality of the translation for a 'distant' language pair like English-Hindi. We proposed new techniques for efficient reordering. A slight improvement over the base-line is reported using these techniques. We also show that a simple(More)
In this paper we use the popular phrase-based SMT techniques for the task of machine transliteration, for English-Hindi language pair. Minimum error rate training has been used to learn the model weights. We have achieved an accuracy of 46.3% on the test set. Our results show these techniques can be successfully used for the task of machine transliteration.
Named Entity Recognition(NER) is the task of identifying and classifying tokens in a text document into predefined set of classes. In this paper we show our experiments with various feature combinations for Tel-ugu NER. We also observed that the prefix and suffix information helps a lot in finding the class of the token. We also show the effect of the(More)
This paper describes a machine learning algorithm for Gujarati Part of Speech Tagging. The machine learning part is performed using a CRF model. The features given to CRF are properly chosen keeping the linguistic aspect of Gujarati in mind. As Gujarati is currently a less privileged language in the sense of being resource poor, manually tagged data is only(More)
Text search is a key step in any kind of information access. For doing it effectively, we can use knowledge about the concerned writing systems. Methods based on such knowledge can give significantly better results for searching text, at least for some languages. This can improve information retrieval in particular and information access in general. In this(More)
In this paper, we present five models for sentence realisation from a bag-of-words containing minimal syntactic information. It has a large variety of applications ranging from Machine Translation to Dialogue systems. Our models employ simple and efficient techniques based on n-gram Language modeling. We evaluated the models by comparing the synthesized(More)
Almost all transactions ranging from various domains such as travel, shopping, insurance, entertainment, hotels, appointments etc. are available through Internet based applications. Needless to say, all these applications require the knowledge of English. As Internet users are growing day by day, it is logical to say that, there is a great demand to develop(More)