Learn More
Code-mixing is frequently observed in user generated content on social media, especially from multilingual users. The linguistic complexity of such content is compounded by presence of spelling variations , transliteration and non-adherance to formal grammar. We describe our initial efforts to create a multi-level annotated corpus of Hindi-English(More)
Alternative paths to linguistic annotation, such as those utilizing games or exploiting the web users, are becoming popular in recent times owing to their very high benefit-to-cost ratios. In this paper, however, we report a case study on POS annotation for Bangla and Hindi, where we observe that reliable linguistic annotation requires not only expert(More)
Voice user interfaces for ICTD applications have immense potential in their ability to reach to a large illiterate or semi-literate population in these regions where text-based interfaces are of little use. However, building speech systems for a new language is a highly resource intensive task. There have been attempts in the past to develop techniques to(More)
For many languages that use non-Roman based indigenous scripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or multi-lingual space with more than one script which we refer to as the Mixed-Script space. IR in the(More)
This paper describes a method to mine Hindi-English transliteration pairs from online Hindi song lyrics. The technique is based on the observations that lyrics are transliterated word-byword , maintaining the precise word order. The mining task is nevertheless challenging because the Hindi lyrics and its transliterations are usually available from(More)
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality Hindi Text-to-Speech (TTS) system. The Festival framework is chosen for developing the Hindi TTS system. Since Festival does not provide complete language processing support specific to various languages, it needs to be augmented to facilitate the(More)
Machine transliteration is used in a number of NLP applications ranging from machine translation and information retrieval to input mechanisms for non-roman scripts. Many popular Input Method Editors for Indian languages, like Baraha, Akshara, Quillpad etc, use back-transliteration as a mechanism to allow users to input text in a number of Indian language.(More)
We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languages (ILs) following the hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are(More)
This paper introduces our efforts to create UPX, an XML-based successor to the venerable UNIPEN format for the representation of annotated datasets of online handwriting data. In the first part of the paper, shortcomings of the UNIPEN format are discussed and the goals of UPX are outlined. Prior work related to UPX in the form of the recently proposed(More)
Designing ICT systems for rural users in the developing world is difficult for a variety of reasons ranging from problems with infrastructure to wide differences in user contexts and capabilities. Developing regions may include huge variability in spoken languages, and users are often low- or non-literate, with very little experience interacting with(More)