• Corpus ID: 222066588

Development of Word Embeddings for Uzbek Language

@article{Mansurov2020DevelopmentOW,
  title={Development of Word Embeddings for Uzbek Language},
  author={B. Mansurov and A. Mansurov},
  journal={ArXiv},
  year={2020},
  volume={abs/2009.14384}
}
In this paper, we share the process of developing word embeddings for the Cyrillic variant of the Uzbek language. The result of our work is the first publicly available set of word vectors trained on the word2vec, GloVe, and fastText algorithms using a high-quality web crawl corpus developed in-house. The developed word embeddings can be used in many natural language processing downstream tasks. 

UzBERT: pretraining a BERT model for Uzbek

UzBERT is introduced, a pretrained Uzbek language model based on the BERT architecture that greatly outperforms multilingual BERT on masked language model accuracy and is made publicly available under the MIT open-source license.

Query Expansion based on Word Embeddings and Ontologies for Efficient Information Retrieval

A novel two-level query expansion algorithm that utilizes the combination of web ontologies and word embeddings for similarity calculation and has given distinct results and has showcased significant improvement of 93% over the initial user query.

Automatic Exam Correction Framework (AECF) for the MCQs, Essays, and Equations Matching

An automatic exam correction framework (HMB-AECF) for MCQs, essays, and equations that is abstracted into five layers is proposed and the proposed equations similarity checker algorithm reported 100% accuracy over the SymPy Python package.

References

SHOWING 1-10 OF 27 REFERENCES

Learning Word Vectors for 157 Languages

This paper describes how high quality word representations for 157 languages were trained on the free online encyclopedia Wikipedia and data from the common crawl project, and introduces three new word analogy datasets to evaluate these word vectors.

When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?

It is shown that pre-trained word embeddings can be surprisingly effective in NMT tasks – providing gains of up to 20 BLEU points in the most favorable setting.

Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

Word Embedding based Generalized Language Model for Information Retrieval

A generalized language model is constructed, where the mutual independence between a pair of words (say t and t') no longer holds and the vector embeddings of the words are made use of to derive the transformation probabilities between words.

Enriching Word Vectors with Subword Information

A new approach based on the skipgram model, where each word is represented as a bag of character n-grams, with words being represented as the sum of these representations, which achieves state-of-the-art performance on word similarity and analogy tasks.

GloVe: Global Vectors for Word Representation

A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

Leveraging word embeddings for spoken document summarization

This paper focuses on building novel and efficient ranking models based on the general word embedding methods for extractive speech summarization, and demonstrates the effectiveness of the proposed methods, compared to existing state-of-the-art methods.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

300d word2vec (skipgram, hierarchical softmax) embeddings for Cyrillic Uzbek using webcrawl v1 corpus