Enriching Word Vectors with Subword Information

@article{Bojanowski2017EnrichingWV,
  title={Enriching Word Vectors with Subword Information},
  author={Piotr Bojanowski and Edouard Grave and Armand Joulin and Tomas Mikolov},
  journal={Transactions of the Association for Computational Linguistics},
  year={2017},
  volume={5},
  pages={135-146}
}
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. [...] Key Method A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word…Expand
Towards Learning Word Representation
Continuous vector representations, as a distributed representations for words have gained a lot of attention in Natural Language Processing (NLP) field. Although they are considered as valuable
Learning to Generate Word Representations using Subword Information
TLDR
Experimental results show clearly that the proposed model significantly outperforms strong baseline models that regard words or their subwords as atomic units and as much as 18.5% improvement on average in perplexity for morphologically rich languages compared to strong baselines in the language modeling task.
Learning Word Vectors for 157 Languages
TLDR
This paper describes how high quality word representations for 157 languages were trained on the free online encyclopedia Wikipedia and data from the common crawl project, and introduces three new word analogy datasets to evaluate these word vectors.
Morphological Skip-Gram: Using morphological knowledge to improve word representation
TLDR
A new method for training word embeddings is proposed, and its goal is to replace the FastText bag of character n-grams for a bag of word morphemes through the morphological analysis of the word.
An Adaptive Wordpiece Language Model for Learning Chinese Word Embeddings
TLDR
A novel approach called BPE+ is established to adaptively generates variable length of grams which breaks the limitation of stroke n-grams and empirical results verify that this method significantly outperforms several state-of-the-art methods.
Measuring Enrichment Of Word Embeddings With Subword And Dictionary Information
TLDR
Results show that fine-tuning the vectors with semantic information dramatically improves performance inword similarity; conversely, enriching word vectors with subword information increases performance in word analogy tasks, with the hybrid approach finding a solid middle ground.
Robust Representation Learning for Low Resource Languages
Understanding the meaning of words is essential for most natural language processing tasks. Word representations are means to mathematically represent the meaning of a word in a way that computers
Named Entity Recognition in Russian with Word Representation Learned by a Bidirectional Language Model
TLDR
This paper presents a semi-supervised approach for adding deep contextualized word representation that models both complex characteristics of word usage and how these usages vary across linguistic contexts, and evaluates the model on FactRuEval-2016 dataset for named entity recognition in Russian and achieves state of the art results.
Morphological Skip-Gram: Replacing FastText characters n-gram with morphological knowledge
TLDR
This work proposes a new method for training word embeddings, and its goal is to replace the FastText bag of character n-grams for a bag of word morphemes through the morphological analysis of the word.
Contextualized Word Representations for Multi-Sense Embedding
TLDR
Methods to generate multiple word representations for each word based on dependency structure relations are proposed that significantly outperformed state-of-the-art methods for multi-sense embeddings and show that the data sparseness problem is resolved due to the pre-training.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 56 REFERENCES
Better Word Representations with Recursive Neural Networks for Morphology
TLDR
This paper combines recursive neural networks, where each morpheme is a basic unit, with neural language models to consider contextual information in learning morphologicallyaware word representations and proposes a novel model capable of building representations for morphologically complex words from their morphemes.
Co-learning of Word Representations and Morpheme Representations
TLDR
This paper introduces the morphological knowledge as both additional input representation and auxiliary supervision to the neural network framework and will produce morpheme representations, which can be further employed to infer the representations of rare or unknown words based on their morphological structure.
Learning Character-level Representations for Part-of-Speech Tagging
TLDR
A deep neural network is proposed that learns character-level representation of words and associate them with usual word representations to perform POS tagging and produces state-of-the-art POS taggers for two languages.
Joint Learning of Character and Word Embeddings
TLDR
A character-enhanced word embedding model (CWE) is presented to address the issues of character ambiguity and non-compositional words, and the effectiveness of CWE on word relatedness computation and analogical reasoning is evaluated.
Efficient Estimation of Word Representations in Vector Space
TLDR
Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
KNET: A General Framework for Learning Word Embedding Using Morphological Knowledge
TLDR
This article introduces a novel neural network architecture called KNET that leverages both words’ contextual information and morphological knowledge to learn word embeddings and demonstrates that the proposed KNET framework can greatly enhance the effectiveness of wordembeddings.
Distributed Representations of Words and Phrases and their Compositionality
TLDR
This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation
TLDR
A model for constructing vector representations of words by composing characters using bidirectional LSTMs that requires only a single vector per character type and a fixed set of parameters for the compositional model, which yields state- of-the-art results in language modeling and part-of-speech tagging.
Word Embeddings Go to Italy: A Comparison of Models and Training Datasets
TLDR
Preliminary results on the generation of word embeddings for the Italian language show that the tested models are able to create syntactically and semantically meaningful word embeddeddings despite the higher morphological complexity of Italian with respect to English.
Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models
TLDR
A novel word-character solution to achieving open vocabulary NMT that can successfully learn to not only generate well-formed words for Czech, a highly-inflected language with a very complex vocabulary, but also build correct representations for English source words.
...
1
2
3
4
5
...