• Corpus ID: 13560202

Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose

@article{Morstatter2013IsTS,
  title={Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose},
  author={Fred Morstatter and J{\"u}rgen Pfeffer and Huan Liu and Kathleen M. Carley},
  journal={ArXiv},
  year={2013},
  volume={abs/1306.5204}
}
Twitter is a social media giant famous for the exchange of short, 140-character messages called "tweets". In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a "Streaming API" which provides a sample of all tweets matching some parameters preset by the API user. The API service has been used by many researchers, companies, and governmental institutions that want to extract… 

Figures and Tables from this paper

Should We Use the Sample? Analyzing Datasets Sampled from Twitter’s Stream API
TLDR
A comparative analysis on samples obtained from two of Twitter’s streaming APIs with a more complete Twitter dataset is performed to gain an in-depth understanding of the nature of Twitter data samples and their potential for use in various data mining tasks.
The Story of Goldilocks and Three Twitter’s APIs: A Pilot Study on Twitter Data Sources and Disclosure
TLDR
This study examines whether tweets collected using the same search filters over the same time period, but calling different APIs, would retrieve comparable datasets, and retrieved tweets about anti-smoking, e-cigarettes, and tobacco using the aforementioned APIs.
Tampering with Twitter’s Sample API
TLDR
It is demonstrated that, due to the nature of Twitter’s sampling mechanism, it is possible to deliberately influence these samples, the extent and content of any topic, and consequently to manipulate the analyses of researchers, journalists, as well as market and political analysts trusting these data sources.
The Tweets They Are a-Changin: Evolution of Twitter Users and Behavior
TLDR
Using a set of over 37 billion tweets spanning over seven years, a number of trends are observed including the spread of Twitter across the globe, the rise of spam and malicious behavior, the rapid adoption of tweeting conventions, and the shift from desktop to mobile usage are observed.
On the endogenesis of Twitter's Spritzer and Gardenhose sample streams
TLDR
Evidence is found for discovering the method used by Twitter to decide which tweets will show up in the random sample streams, and an overview of how Twitter's unique tweet IDs are generated and explain the regularities of each part of a tweet ID is provided.
Impact of Twitter on human interaction
The aim of this research is to examine whether Twitter impacts its users by using data science approach.A comparison between the data collection using Twitter Search API via RStudio and NodeXL has
Surpassing the Limit: Keyword Clustering to Improve Twitter Sample Coverage
TLDR
An initial look is taken at the efficiency of Twitter limit track as a sample population estimator and methods to mitigate bias by improving sample population coverage using clustering techniques are provided.
Sampling Content from Online Social Networks
TLDR
This paper analyzes a different sampling methodology, one where content is gathered only from a relatively small sample (<1%) of the user population, namely, the expert users, and gathers tweets from over 500,000 Twitter users identified as experts on a diverse set of topics.
Is Data Collection through Twitter Streaming API Useful for Academic Research?
TLDR
The experiments showed that when filtering is used for terms that are not very popular, then all the matching Tweets are likely provided by Twitter; in this case, analyzing those Tweets will provide reliable results for research purposes and concurrent processes that collect filtering Tweets for very popular terms tend to return almost the same Tweets.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 66 REFERENCES
Empirical study of topic modeling in Twitter
TLDR
It is shown that by training a topic model on aggregated messages the authors can obtain a higher quality of learned model which results in significantly better performance in two real-world classification problems.
Discovering geographical topics in the twitter stream
TLDR
An algorithm is presented by modeling diversity in tweets based on topical diversity, geographical diversity, and an interest distribution of the user by exploiting sparse factorial coding of the attributes, thus allowing it to deal with a large and diverse set of covariates efficiently.
How Does the Data Sampling Strategy Impact the Discovery of Information Diffusion in Social Media?
TLDR
This paper studies the impact of different attribute and topology based sampling strategies on the discovery of an important social media phenomena–information diffusion, and develops a series of metrics for evaluating the quality of the sample.
Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment
TLDR
It is found that the mere number of messages mentioning a party reflects the election result, and joint mentions of two parties are in line with real world political ties and coalitions.
Limits of Electoral Predictions Using Twitter
TLDR
This work applies techniques that had reportedly led to positive election predictions in the past, on the Twitter data collected from the 2010 US congressional elections, but finds no correlation between the analysis results and the electoral outcomes, contradicting previous reports.
What's in a hashtag?: content based prediction of the spread of ideas in microblogging communities
TLDR
An efficient hybrid approach based on a linear regression for predicting the spread of an idea in a given time frame is presented and it is shown that a combination of content features with temporal and topological features minimizes prediction error.
You are where you tweet: a content-based approach to geo-locating twitter users
TLDR
A probabilistic framework for estimating a Twitter user's city-level location based purely on the content of the user's tweets, which can overcome the sparsity of geo-enabled features in these services and enable new location-based personalized information services, the targeting of regional advertisements, and so on.
TwitterMonitor: trend detection over the twitter stream
TLDR
TwitterMonitor, a system that performs trend detection over the Twitter stream and provides meaningful analytics that synthesize an accurate description of each topic on Twitter in real time, is presented.
We know what @you #tag: does the dual role affect hashtag adoption?
TLDR
This work proposes comprehensive measures to quantify the major factors of how a user selects content tags as well as joins communities, and proves the effectiveness of the dual role, where both the content measures and the community measures significantly correlate to hashtag adoption on Twitter.
On the rise of artificial trending topics in twitter
TLDR
Investigation of how social networks can interfere in Trending Topics seeking for visibility and based on social capital, using bridging and bonding ties points to two types of topics: artificial topics and organic topics.
...
1
2
3
4
5
...