Share This Author
Investigation and modeling of the structure of texting language
- M. Choudhury, R. Saraf, V. Jain, Animesh Mukherjee, S. Sarkar, A. Basu
- Computer ScienceInternational Journal of Document Analysis and…
- 1 December 2007
The nature and type of compressions used in SMS texts are investigated, and a Hidden Markov Model based word-model for TL is developed, which results in a 35% reduction of the relative word level error rates.
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
The relation between the types of languages, resources, and their representation in NLP conferences is looked at to understand the trajectory that different languages have followed over time and underlines the disparity between languages.
POS Tagging of English-Hindi Code-Mixed Social Media Content
- Yogarshi Vyas, Spandana Gella, Jatin Sharma, Kalika Bali, M. Choudhury
- Computer ScienceEMNLP
- 1 October 2014
The initial efforts to create a multi-level annotated corpus of Hindi-English codemixed text collated from Facebook forums are described, and language identification, back-transliteration, normalization and POS tagging of this data are explored.
"I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook
The classification of Code-Mixed words based on frequency and linguistic typology underline the fact that while there are easily identifiable cases of borrowing and mixing at the two ends, a large majority of the words form a continuum in the middle, emphasizing the need to handle these at different levels for automatic processing of the data.
GLUECoS: An Evaluation Benchmark for Code-Switched NLP
- Simran Khanuja, Sandipan Dandapat, A. Srinivasan, Sunayana Sitaram, M. Choudhury
- Computer Science, LinguisticsACL
- 26 April 2020
This work presents an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish, and shows that in most tasks, across both language pairs, multilingual models fine-tuned on code- Switched data perform best, showing that mult bilingual models can be further optimized forcode-switching tasks.
Query expansion for mixed-script information retrieval
This paper formally introduces the concept of Mixed-Script IR, and through analysis of the query logs of Bing search engine, the prevalence of this problem is estimated, and gives a principled solution to handle the mixed-script term matching and spelling variation.
Overview of the FIRE 2013 Track on Transliterated Search
- Rishiraj Saha Roy, M. Choudhury, Prasenjit Majumder, Komal Agarwal
- Computer ScienceFIRE
- 4 December 2013
An overview of the FIRE 2013 track on transliterated search is provided and there is considerable scope for improvement of transliteration accuracies for the studied languages.
Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data
- Adithya Pratapa, G. Bhat, M. Choudhury, Sunayana Sitaram, Sandipan Dandapat, Kalika Bali
- Computer ScienceACL
- 22 May 2018
A computational technique for creation of grammatically valid artificial CM data based on the Equivalence Constraint Theory is presented and it is shown that when training examples are sampled appropriately from this synthetic data and presented in certain order, it can significantly reduce the perplexity of an RNN-based language model.
Word Embeddings for Code-Mixed Language Processing
This study demonstrates that existing bilingual embedding techniques are not ideal for code-mixed text processing and there is a need for learning multilingual word embedding from the code- mixed text.
Overview of FIRE 2014 Track on Transliterated Search
The Transliterated Search track has been organized for the second year in FIRE. The track has two subtasks. Subtask 1 on language labeling of words in code-mixed text fragments was conducted for 6…