cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information

  title={cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information},
  author={Shaosheng Cao and Wei Lu and Jun Zhou and Xiaolong Li},
We propose cw2vec, a novel method for learning Chinese word embeddings. [] Key Result Empirical results on the word similarity, word analogy, text classification and named entity recognition tasks show that the proposed approach consistently outperforms state-of-the-art approaches such as word-based word2vec and GloVe, character-based CWE, component-based JWE and pixel-based GWE.

Figures and Tables from this paper

An Adaptive Wordpiece Language Model for Learning Chinese Word Embeddings

A novel approach called BPE+ is established to adaptively generates variable length of grams which breaks the limitation of stroke n-grams and empirical results verify that this method significantly outperforms several state-of-the-art methods.

Radical and Stroke-Enhanced Chinese Word Embeddings Based on Neural Networks

The proposed Radical and Stroke-enhanced Word Embeddings (RSWE), a novel method based on neural networks for learning Chinese word embeddings with joint guidance from semantic and morphological internal information, outperforms existing state-of-the-art approaches.

VCWE: Visual Character-Enhanced Word Embeddings

A model to learn Chinese word embeddings via three-level composition using a convolutional neural network to extract the intra-character compositionality from the visual shape of a character; a recurrent neural network with self-attention to compose character representation into word embedDings; and the Skip-Gram framework to capture non-compositionality directly from the contextual information.

Learning Chinese Word Embeddings from Stroke, Structure and Pinyin of Characters

A novel method ssp2vec is proposed to predict the contextual words based on the feature substrings of the target words for learning Chinese word embeddings, and it is shown that the proposed method obtains better results than state-of-the-art approaches.

Learning Chinese word representation better by cascade morphological n-gram

By overlaying component and stroke n -gram vectors on word vectors, this paper successfully improves Chinese word embedding so as to preserve as more morphological information as possible at different granularity levels.

Joint Fine-Grained Components Continuously Enhance Chinese Word Embeddings

This work proposes a continuously enhanced word embedding model that starts with fine-grained strokes and adjacent stroke information and enhances subcharacter embedding by combining the relationship vector representation between strokes.

Hierarchical Joint Learning for Chinese Word Embeddings

This work proposes a method called HJWE which predicts the target word, characters and sub-characters in the targetword at the same time and shows that this method performs best on the word similarity, word analogy and text classification tasks.

Attention Enhanced Chinese Word Embeddings

A new Chinese word embeddings method called AWE is introduced by utilizing attention mechanism to enhance Mikolov’s CBOW and proposes P&AWE, which far exceed the CBOW model, and achieve state-of-the-art performances on the task of word similarity.

A survey of word embeddings based on deep learning

The recent advances of neural networks-based word embeddings with their technical features are introduced, summarizing the key challenges and existing solutions, and a future outlook on the research and application are given.



Improving Word Embeddings with Convolutional Feature Learning and Subword Information

A convolutional neural network architecture is introduced that allows us to measure structural information of context words and incorporate subword features conveying semantic, syntactic and morphological information related to the words.

Joint Learning of Character and Word Embeddings

A character-enhanced word embedding model (CWE) is presented to address the issues of character ambiguity and non-compositional words, and the effectiveness of CWE on word relatedness computation and analogical reasoning is evaluated.

Improve Chinese Word Embeddings by Exploiting Internal Structure

This paper proposes a similaritybased method to learn Chinese word and character embeddings jointly by exploiting the similarity between a word and its component characters with the semantic knowledge obtained from other languages.

Improved Learning of Chinese Word Embeddings with Semantic Knowledge

The basic idea is to take the semantic knowledge about words and their component characters into account when designing composition functions, and experiments show that this approach outperforms two strong baselines on word similarity, word analogy, and document classification tasks.

Charagram: Embedding Words and Sentences via Character n-grams

It is demonstrated that Charagram embeddings outperform more complex architectures based on character-level recurrent and convolutional neural networks, achieving new state-of-the-art performance on several similarity tasks.

Enriching Word Vectors with Subword Information

A new approach based on the skipgram model, where each word is represented as a bag of character n-grams, with words being represented as the sum of these representations, which achieves state-of-the-art performance on word similarity and analogy tasks.

Word Representations: A Simple and General Method for Semi-Supervised Learning

This work evaluates Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeds of words on both NER and chunking, and finds that each of the three word representations improves the accuracy of these baselines.

Learning word embeddings efficiently with noise-contrastive estimation

This work proposes a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation, and achieves results comparable to the best ones reported, using four times less data and more than an order of magnitude less computing time.

Learning Chinese Word Representations From Glyphs Of Characters

The character glyph features are directly learned from the bitmaps of characters by convolutional auto-encoder(convAE), and the glyph features improve Chinese word representations which are already enhanced by character embeddings.

Knowledge-Powered Deep Learning for Word Embedding

This study explores the capacity of leveraging morphological, syntactic, and semantic knowledge to achieve high-quality word embeddings, and explores these types of knowledge to define new basis for word representation, provide additional input information, and serve as auxiliary supervision in deep learning.