• Corpus ID: 232478708

Evaluating Neural Word Embeddings for Sanskrit

@article{Sandhan2021EvaluatingNW,
  title={Evaluating Neural Word Embeddings for Sanskrit},
  author={Jivnesh Sandhan and Om Adideva and Digumarthi Komal and Laxmidhar Behera and Pawan Goyal},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.00270}
}
Recently, the supervised learning paradigm’s surprisingly remarkable performance has garnered considerable attention from Sanskrit Computational Linguists. As a result, the Sanskrit community has put laudable efforts to build task-specific labeled data for various downstream Natural Language Processing (NLP) tasks. The primary component of these approaches comes from representations of word embeddings. Word embedding helps to transfer knowledge learned from readily available unlabelled data for… 

Figures and Tables from this paper

Systematic Investigation of Strategies Tailored for Low-Resource Settings for Sanskrit Dependency Parsing

This work investigates the following question: How far can a purely data-driven approach using recently proposed strategies for low-resource settings be pushed, and experiments with five strategies, namely, data augmentation, sequential transfer learning, cross-lingual/mono-lingUAL pretraining, multi-task learning and self-training.

Embeddings models for Buddhist Sanskrit

It is shown that for contextual models the optimal layer combination for embedding construction is task dependant, and that pretraining the contextual embeddings models on a reference corpus of general Sanskrit is beneficial, which is a promising finding for future development of embeddins for less-resourced languages and domains.

A Novel Multi-Task Learning Approach for Context-Sensitive Compound Type Identification in Sanskrit

This work proposes a novel multi-task learning architecture which incorporates the contextual information and enriches the complementary syntactic information using morphological tagging and dependency parsing as two auxiliary tasks for Sanskrit Compound Type Identification.

Creation of a Digital Rig Vedic Index (Anukramani) for Computational Linguistic Tasks

An index of Rig Vedic verses along with the respective devat¯a, r.s.i and chandas is presented which is, in short, a digitized form of the well known Rigvedic Anukraman.

Filtering and Extended Vocabulary based Translation for Low-resource Language pair of Sanskrit-Hindi

An in-depth analysis to address the challenges in translating Sanskrit with the help of a low-resource Sanskrit-Hindi language pair using a novel training corpus filtering with extended vocabulary in a zero-shot transformer architecture.

References

SHOWING 1-10 OF 51 REFERENCES

Embeddings in Natural Language Processing: Theory and Advances in Vector Representations of Meaning

This book, written by Mohammad Taher Pilehvar and Jose Camacho-Collados, provides a comprehensive and easy-to-read review of the theory and advances in vector models for NLP, focusing specially on semantic representations and their applications.

Poetry to Prose Conversion in Sanskrit as a Linearisation Task: A Case for Low-Resource Languages

Kāvya guru outperforms current state of the art models in linearisation for the poetry to prose conversion task in Sanskrit by outperforming them based on the original order in the verse.

A Survey of Word Embeddings Evaluation Methods

An extensive overview of the field of word embeddings evaluation is presented, highlighting main problems and proposing a typology of approaches to evaluation, summarizing 16 intrinsic methods and 12 extrinsic methods.

Exploring the Limits of Language Modeling

This work explores recent advances in Recurrent Neural Networks for large scale Language Modeling, and extends current models to deal with two key challenges present in this task: corpora and vocabulary sizes, and complex, long term structure of language.

Improving Distributional Similarity with Lessons Learned from Word Embeddings

It is revealed that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves, and these modifications can be transferred to traditional distributional models, yielding similar gains.

Enriching Word Vectors with Subword Information

A new approach based on the skipgram model, where each word is represented as a bag of character n-grams, with words being represented as the sum of these representations, which achieves state-of-the-art performance on word similarity and analogy tasks.

Building a Word Segmenter for Sanskrit Overnight

This work proposes an approach that uses a deep sequence to sequence (seq2seq) model that takes only the sandhied string as the input and predicts the unsandHied string and preforms better than the current state of the art.

Revisiting the Role of Feature Engineering for Compound Type Identification in Sanskrit

An automated approach for semantic class identification of compounds in Sanskrit and it is shown that the best system with LSTM architecture and FastText embedding with end-to-end training has shown promising results.

Deep Contextualized Word Representations

A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.

Universal Language Model Fine-tuning for Text Classification

This work proposes Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduces techniques that are key for fine- Tuning a language model.
...