Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
- Gowtham Ramesh, Sumanth Doddapaneni, Mitesh Khapra
- Computer ScienceInternational Conference on Topology, Algebra and…
- 12 April 2021
Samanantar is presented, the largest publicly available parallel corpora collection for Indic languages and multilingual NMT models spanning all these languages are trained which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar.
A Primer on Pretrained Multilingual Language Models
- Sumanth Doddapaneni, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
- Computer SciencearXiv.org
- 1 July 2021
A review of the existing literature covering the above broad areas of research pertaining toMultilingual Language Models and some promising directions of future research are recommended.
Bitions@DravidianLangTech-EACL2021: Ensemble of Multilingual Language Models with Pseudo Labeling for offence Detection in Dravidian Languages
- Debapriya Tula, Prathyush Potluri, P. Patwa
- Computer ScienceDRAVIDIANLANGTECH
- 2021
A multilingual ensemble-based model is proposed that can identify offensive content targeted against an individual (or group) in low resource Dravidian language and is able to handle code-mixed data as well as instances where the script used is mixed.
Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages
- Kaushal Bhogale, A. Raman, Mitesh M. Khapra
- Computer ScienceIEEE International Conference on Acoustics…
- 26 August 2022
This work creates Shrutilipi, a dataset which contains over 6,400 hours of labelled audio across 12 Indian languages totalling to 4.95M sentences, and adapts the Needleman-Wunsch algorithm to align sentences with corresponding audio segments given a long audio and a PDF of its transcript, while being robust to errors due to OCR, extraneous text, and non-transcribed speech.
Towards Building ASR Systems for the Next Billion Users
- Tahir Javed, Sumanth Doddapaneni, Mitesh M. Khapra
- Computer ScienceAAAI Conference on Artificial Intelligence
- 6 November 2021
This work establishes that multilingual pretraining is an effective strategy for building ASR systems for the linguistically diverse speakers of the Indian subcontinent.
A Survey of Adversarial Defences and Robustness in NLP
- Shreyansh Goyal, Sumanth Doddapaneni, Mitesh M.Khapra, B. Ravindran
- Economics, ArtACM Computing Surveys
- 12 March 2022
In the past few years, it has become increasingly evident that deep neural networks are not resilient enough to withstand adversarial perturbations in input data, leaving them vulnerable to attack.…
IndicXTREME: A Multi-Task Benchmark For Evaluating Indic Languages
- Sumanth Doddapaneni, Rahul Aralikatte, Pratyush Kumar
- Computer SciencearXiv.org
- 2022
IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indian sub-continent belonging to four different families, is introduced, the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models.
Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages
- Arnav Mhaske, Harsh Kedia, Anoop Kunchukuttan
- Computer SciencearXiv.org
- 20 December 2022
The largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families is presented and a multilingual IndicBERT model fine-tuned on Naamapadam training set is released.
Offence Detection in Dravidian Languages Using Code-Mixing Index-Based Focal Loss
- Debapriya Tula, M. Shreyas, P. Patwa
- Computer ScienceSN Computer Science
- 12 November 2021
A novel code-mixing index (CMI) based focal loss is introduced which circumvents two challenges (1) code- Mixing in languages (2) class imbalance problem for Dravidian language offence detection and can handle offensive language detection in a low-resource, class imbalanced, multilingual and code mixed setting.
Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages
- Rahul Aralikatte, Ziling Cheng, Sumanth Doddapaneni, J. Cheung
- Computer SciencearXiv.org
- 10 May 2023
It is shown that the dataset is challenging even for state-of-the-art abstractive models and that they perform only slightly better than extractive baselines, but can be used to pretrain strong language models that outperform competitive baselines in both NLU and NLG benchmarks.
...
...