A Large-Scale COVID-19 Twitter Chatter Dataset for Open Scientific Research—An International Collaboration

@article{Banda2021ALC,
  title={A Large-Scale COVID-19 Twitter Chatter Dataset for Open Scientific Research—An International Collaboration},
  author={J. Banda and Ramya Tekumalla and Guanyu Wang and Jingyuan Yu and Tuo Liu and Yuning Ding and Katya Artemova and E. Tutubalina and Gerardo Chowell},
  journal={Epidemiologia},
  year={2021}
}
As the COVID-19 pandemic continues to spread worldwide, an unprecedented amount of open data is being generated for medical, genetics, and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated on the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and… 

Figures and Tables from this paper

IRLCov19: A Large COVID-19 Multilingual Twitter Dataset of Indian Regional Languages
TLDR
The dataset related to COVID-19 collected in the period between February 2020 to July 2020 specifically for regional languages in India is studied to help the Government of India, various state governments, NGOs, researchers, and policymakers in studying different issues related to the pandemic.
TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity, Geo, and Gender Labels
TLDR
A large-scale social sensing dataset comprising two billion multilingual tweets posted from 218 countries by 87 million users in 67 languages is offered, believing this multilingual data with broader geographical and longer temporal coverage will be a cornerstone for researchers to study impacts of the ongoing global health catastrophe and to manage adverse consequences related to people’s health, livelihood, and social well-being.
Long-term patient-reported symptoms of COVID-19: an analysis of social media data
TLDR
This work uses a combination of natural language processing and clinician reviews to identify long term self-reported symptoms on a set of Twitter users, and identifies latent symptoms that might be underreported in other places.
COVID-19 Vaccine Hesitancy: Analysing Twitter to Identify Barriers to Vaccination in a Low Uptake Region of the UK
TLDR
There is promising utility for using off-the-shelf NLP tools to leverage insights from social media data to support public health research, and safety concerns; mistrust of government and pharmaceutical companies; and accessibility issues as key barriers limiting vaccine uptake are identified.
Revealing the linguistic and geographical disparities of public awareness to Covid-19 outbreak through social media
TLDR
Results show that users presenting the highest Covid-19 awareness were mainly those tweeting in the official languages of India and Bangladesh, and the Ratio index had high correlations with global mortality rate, global case fatality ratio, and country-level mortality rate.
Anbar: Collection and analysis of a large scale Urdu language Twitter corpus
TLDR
This paper builds and analyze a large scale Urdu language Twitter corpus Anbar, which can be used for Natural Language Understanding, social analytics, and fake news detection and examines Anbar using a variety of metrics like temporal frequency of tweets, vocabulary size, geo-location, user characteristics, and entities distribution.
Negative Perception of the COVID-19 Pandemic Is Dropping: Evidence From Twitter Posts
TLDR
It is shown that the negative perception of the people of the COVID-19 pandemic decreased intensively when the vaccination campaign started in the USA, Canada, and the UK and has remained to decrease steadily since then, leading to the conclusion that vaccination plays a key role in dropping the negativity of thePeople, thus promoting their psychological wellbeing.
Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset
TLDR
The analyses show that COVID-19 misinformed communities are denser, and more organized than informed communities, with a possibility of a high volume of the misinformation being part of disinformation campaigns.
Emotional Analysis of Twitter Posts During the First Phase of the COVID-19 Pandemic in Greece: Infoveillance Study
TLDR
Although the Greeks felt rather safe during the first phase of the COVID-19 pandemic, their positive and negative emotions reflected a masked "flight or fight" or "fear versus anger" response to the contagion.
Changes in Public Response Associated With Various COVID-19 Restrictions in Ontario, Canada: Observational Infoveillance Study Using Social Media Time Series Data
TLDR
This study demonstrates the feasibility of a rapid and flexible method of evaluating the public response to pandemic restrictions using near real-time social media data.
...
...

References

SHOWING 1-10 OF 43 REFERENCES
A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration
TLDR
A large-scale curated dataset of over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st to April 4th at the time of writing, will allow researchers to conduct a number of research projects relating to the emotional and mental responses to social distancing measures and the identification of sources of misinformation.
A Google–Wikipedia–Twitter Model as a Leading Indicator of the Numbers of Coronavirus Deaths
TLDR
It is found that a model with the number of Google searches, Twitter tweets, and Wikipedia page views provides a leading indicator model of the numberof people in the USA who will become infected and die from the coronavirus.
Characterization of long-term patient-reported symptoms of COVID-19: an analysis of social media data
TLDR
A longitudinal characterization of post-COVID-19 symptoms using social media data from Twitter is presented, using a combination of machine learning, natural language processing techniques, and clinician reviews to characterize the post-acute infection course of the disease.
Analysis of Twitter Data Using Evolutionary Clustering during the COVID-19 Pandemic
People started posting textual tweets on Twitter as soon as the novel coronavirus (COVID-19) emerged Analyzing these tweets can assist institutions in better decision-making and prioritizing their
Long-term patient-reported symptoms of COVID-19: an analysis of social media data
TLDR
This work uses a combination of natural language processing and clinician reviews to identify long term self-reported symptoms on a set of Twitter users, and identifies latent symptoms that might be underreported in other places.
Social Media Mining Toolkit (SMMT)
TLDR
The Social Media Mining Toolkit (SMMT), a suite of tools aimed to encapsulate the cumbersome details of acquiring, preprocessing, annotating and standardizing social media data, is introduced, simplifying research reproducibility and accessibility in the social media domain.
Characterization of Potential Drug Treatments for COVID-19 using Social Media Data and Machine Learning
TLDR
A large twitter dataset of 424 million tweets of COVID-19 chatter is mined to identify discourse around potential treatments, demonstrating the need of machine learning methods to aid in this task.
Using Tweets to Understand How COVID-19–Related Health Beliefs Are Affected in the Age of Social Media: Twitter Data Analysis Study
TLDR
The number of users tweeting about COVID-19 health beliefs was amplifying in an epidemic manner and could partially intensify the infodemic, according to the classic epidemiology model.
Mining Archive.org’s Twitter Stream Grab for Pharmacovigilance Research Gold
TLDR
This work demonstrates how it mined over 9.4 billion Tweets from archive.org’s Twitter stream grab using a drug-term dictionary and plenty of computing power and used pre-existing drug-related datasets to build machine learning models to filter the findings for relevance.
Understanding the Public Discussion About the Centers for Disease Control and Prevention During the COVID-19 Pandemic Using Twitter Data: Text Mining Analysis Study
TLDR
This study identifies the topics and their overarching themes emerging from the public COVID-19-related discussion about the CDC on Twitter to provide insight into public's concerns, focus of attention, perception of the CDC's current performance, and expectations from the CDC.
...
...