Automated Detection of Non-Relevant Posts on the Russian Imageboard "2ch": Importance of the Choice of Word Representations

@article{Bakarov2017AutomatedDO,
  title={Automated Detection of Non-Relevant Posts on the Russian Imageboard "2ch": Importance of the Choice of Word Representations},
  author={Amir Bakarov and Olga Gureenkova},
  journal={ArXiv},
  year={2017},
  volume={abs/1707.04860}
}
This study considers the problem of automated detection of non-relevant posts on Web forums and discusses the approach of resolving this problem by approximation it with the task of detection of semantic relatedness between the given post and the opening post of the forum discussion thread. The approximated task could be resolved through learning the supervised classifier with a composed word embeddings of two posts. Considering that the success in this task could be quite sensitive to the… 

Selective word encoding for effective text representation

TLDR
This paper adapts a trainable orderless aggregation algorithm to obtain a more discriminative abstract representation for text representation and proposes an effective term-weighting scheme to compute the relative importance of words from the context based on their conjunction with the problem in an end-to-end learning manner.

A Survey of Word Embeddings Evaluation Methods

TLDR
An extensive overview of the field of word embeddings evaluation is presented, highlighting main problems and proposing a typology of approaches to evaluation, summarizing 16 intrinsic methods and 12 extrinsic methods.

A Precisely Xtreme-Multi Channel Hybrid Approach for Roman Urdu Sentiment Analysis

TLDR
A novel precisely extreme multi-channel hybrid methodology which makes use of convolutional and recurrent neural networks along with pre-trained neural word embeddings is proposed in order to improve the performance of Roman Urdu sentiment analysis.

Anomaly Detection for Short Texts: Identifying Whether Your Chatbot Should Switch from Goal-Oriented Conversation to Chit-Chatting

TLDR
This work compares performance of 6 different anomaly detection methods on Russian and English short texts modeling conversational utterances, proposing the first evaluation framework for this task and finds out that a simple threshold for cosine similarity works better than other methods for both of the considered languages.

Détection de signaux faibles dans des masses de données faiblement structurées

L'etude presentee s'inscrit dans le cadre du developpement d'une plateforme d'analyse automatique de documents associee a un service securise lanceurs d'alerte, de type GlobalLeaks. Cet article se

References

SHOWING 1-10 OF 10 REFERENCES

Enriching Word Vectors with Subword Information

TLDR
A new approach based on the skipgram model, where each word is represented as a bag of character n-grams, with words being represented as the sum of these representations, which achieves state-of-the-art performance on word similarity and analogy tasks.

Human and Machine Judgements for Russian Semantic Relatedness

TLDR
This work uses one of the best approaches identified in this competition to generate the fifth high-coverage resource, the first open distributional thesaurus of Russian, which multiple evaluations indicate its high accuracy.

GloVe: Global Vectors for Word Representation

TLDR
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

Distributed Representations of Words and Phrases and their Compositionality

TLDR
This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

Two/Too Simple Adaptations of Word2Vec for Syntax Problems

We present two simple modifications to the models in the popular Word2Vec tool, in order to generate embeddings more suited to tasks involving syntax. The main issue with the original models is the

*SEM 2013 shared task: Semantic Textual Similarity

TLDR
The CORE task attracted 34 participants with 89 runs, and the TYPED task attracted 6 teams with 14 runs, with relative high interannotator correlation, ranging from 62% to 87%.

Dependency-Based Word Embeddings

TLDR
The skip-gram model with negative sampling introduced by Mikolov et al. is generalized to include arbitrary contexts, and experiments with dependency-based contexts are performed, showing that they produce markedly different embeddings.

Swivel: Improving Embeddings by Noticing What's Missing

We present Submatrix-wise Vector Embedding Learner (Swivel), a method for generating low-dimensional feature embeddings from a feature co-occurrence matrix. Swivel performs approximate factorization

Classifying Sentences as Speech Acts in Message Board Posts

TLDR
The goal is to create sentence classifiers that can identify whether a sentence contains a speech act, and can recognize sentences containing four different speech act classes: Commissives, Directives, Expressives, and Representatives.