• Corpus ID: 53467656

Comparison of Word Embeddings and Sentence Encodings as Generalized Representations for Crisis Tweet Classification Tasks

  title={Comparison of Word Embeddings and Sentence Encodings as Generalized Representations for Crisis Tweet Classification Tasks},
  author={Hongmin Li},
Many machine learning and natural language processing approaches, including supervised and domain adaptation algorithms, have been proposed and studied in the context of filtering crisis tweets. However, the application of these approaches in practice is still challenging due to the time-critical requirements of emergency response operations, and also to the diversity and unique characteristics of emergency events. To address this limitation, we explore the idea of building “generalized… 

Tables from this paper

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

A comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media finds that only four out of the 26 pre-processings improve the classification accuracy significantly.

Classification of Incident-related Tweets: Exploiting Word and Sentence Embeddings

This paper uses recently proposed, publicy available word and sentence embeddings and deep neural network models for this task to classify crisis-related tweets into a variable set of information classes and to provide an importance score.

Fighting Fake News Using Deep Learning: Pre-trained Word Embeddings and the Embedding Layer Investigated

This work prepared a dataset from a scrape of 13 years of continuous data that they believe will narrow the gap in early detection of fake news and proposed a deep learning model for early detection using convolutional neural networks and long short-term memory networks.

Pretrained Transformer Language Models Versus Pretrained Word Embeddings for the Detection of Accurate Health Information on Arabic Social Media: Comparative Study

This study aims to train and compare the performance of multiple deep learning models that use pretrained word embeddings and transformer language models, and indicates that the pretrained language model AraBERTv0.2 is the best model for classifying tweets as carrying either inaccurate or accurate health information.

A simple method for domain adaptation of sentence embeddings

This paper presents a simple universal method for finetuning Google's Universal Sentence Encoder (USE) using a Siamese architecture and presents an embedding finetuned on all data sets in parallel.

Information Overload in Crisis Management: Bilingual Evaluation of Embedding Models for Clustering Social Media Posts in Emergencies

An embedding-based clustering approach and a method for the automated labelling of clusters are proposed and it is found that some embeddings were not able to perform as well on a German dataset as on an English dataset.

Predicting Themes within Complex Unstructured Texts: A Case Study on Safeguarding Reports

This paper focuses on the problem of automatically identifying the main themes in a safeguarding report using supervised classification approaches and shows the potential of deep learning models to simulate subject-expert behaviour even for complex tasks with limited labelled data.

Efficient Turkish tweet classification system for crisis response

The first ever Turkish tweet dataset for crisis response is presented, which can be widely used and improve similar studies, and vector space model techniques were studied to find out the most suitable technique that can be used for the Turkish language.

A Comprehensive Comparison of Word Embeddings in Event & Entity Coreference Resolution

Coreference Resolution is an important NLP task and most state-of-the-art methods rely on word embeddings for word representation. However, one issue that has been largely overlooked in literature is



Representation learning for very short texts using weighted word embedding aggregation

Using word embeddings in Twitter election classification

This paper investigates the impact of the background dataset used to train the embedding models, as well as the parameters of the word embedding training process, namely the context window size, the dimensionality and the number of negative samples, on the attained classification performance and finds that large context window and dimension sizes are preferable to improve the performance.

Cross-Language Domain Adaptation for Classifying Crisis-Related Short Messages

The past labels of past events are shown to be useful when both source and target events are of the same type (e.g. both earthquakes), and cross-language domain adaptation was useful, however, when for different languages, the performance decreased.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

It is shown how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks.

Applications of Online Deep Learning for Crisis Response Using Social Media Information

A new online algorithm based on stochastic gradient descent is proposed to train DNNs in an online fashion during disaster situations to address two types of information needs of response organizations: identifying informative tweets and classifying them into topical classes.

Universal Sentence Encoder

It is found that transfer learning using sentence embeddings tends to outperform word level transfer with surprisingly good performance with minimal amounts of supervised training data for a transfer task.

Rapid Classification of Crisis-Related Data on Social Networks using Convolutional Neural Networks

This work introduces neural network based classification methods for binary and multi-class tweet classification task and shows that these models do not require any feature engineering and perform better than state-of-the-art methods.

Semantic Abstraction for generalization of tweet classification: An evaluation of incident-related tweets

Semantic Abstraction is presented to improve the generalization of tweet classification and derived features from Linked Open Data and include location and temporal mentions are derived and shown to be valuable means for improving generalization.

Evaluating Word Embeddings Using a Representative Suite of Practical Tasks

This work proposes evaluating word embeddings in vivo by evaluating them on a suite of popular downstream tasks by using simple models with few tuned hyperparameters.