Learn More
Code-mixing is frequently observed in user generated content on social media, especially from multilingual users. The linguistic complexity of such content is compounded by presence of spelling variations, transliteration and non-adherance to formal grammar. We describe our initial efforts to create a multi-level annotated corpus of Hindi-English codemixed(More)
For many languages that use non-Roman based indigenous scripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or multi-lingual space with more than one script which we refer to as the Mixed-Script space. IR in the(More)
Alternative paths to linguistic annotation, such as those utilizing games or exploiting the web users, are becoming popular in recent times owing to their very high benefit-to-cost ratios. In this paper, however, we report a case study on POS annotation for Bangla and Hindi, where we observe that reliable linguistic annotation requires not only expert(More)
This paper reports preliminary results of data-driven modeling of segmental (phoneme) duration for Hindi. Classification and Regression Tree (CART) based datadriven duration modeling for segmental duration prediction is presented. A number of features are considered and their usefulness and relative contribution for segmental duration prediction is(More)
Designing ICT systems for rural users in the developing world is difficult for a variety of reasons ranging from problems with infrastructure to wide differences in user contexts and capabilities. Developing regions may include huge variability in spoken languages, and users are often low- or non-literate, with very little experience interacting with(More)
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality Hindi Text-to-Speech (TTS) system. The Festival framework is chosen for developing the Hindi TTS system. Since Festival does not provide complete language processing support specific to various languages, it needs to be augmented to facilitate the(More)
This paper introduces our efforts to create UPX, an XML-based successor to the venerable UNIPEN format for the representation of annotated datasets of online handwriting data. In the first part of the paper, shortcomings of the UNIPEN format are discussed and the goals of UPX are outlined. Prior work related to UPX in the form of the recently proposed(More)
We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languages (ILs) following the hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are(More)
Back-transliteration based Input Method Editors are very popular for Indian Languages. In this paper we evaluate two such Indic language systems to help understand the challenge of designing a back-transliteration based IME. Through a detailed error-analysis of Hindi, Bangla and Telugu data, we study the role of phonological features of Indian scripts that(More)
This paper describes a method to mine Hindi-English transliteration pairs from online Hindi song lyrics. The technique is based on the observations that lyrics are transliterated word-by-word, maintaining the precise word order. The mining task is nevertheless challenging because the Hindi lyrics and its transliterations are usually available from(More)