Is the Sample Good Enough? Comparing Twitter’s Streaming API with Twitter’s Firehose

In the past 7 years, Twitter has grown to become a social media giant. The service now facilitates the exchange of over 400 million “tweets”, or short 140-character messages, per day. The massive extent of this (mostly) publicly-exchanged communication data offers a tremendous opportunity to researchers, companies, and governmental institutions.

Access to the “Firehose”, a feed provided by Twitter that allows access to 100% of the publicly available tweets, can be both difficult to manage and restrictive with respect to cost. As a result, many researchers rely on Twitter’s “Streaming API”, which provides a sample of all tweets matching parameters pre-set by the API user. However, this API suffers from an essential drawback regarding the lack of documentation about how much and what kind of data researchers will get. In this paper, we aim to answer the question: To what extent is the sampled data offered through the Streaming API a valid representation of the overall activity on Twitter?

Tag Clouds
Our analysis focused on Tweets collected in the region around Syria during the period from December 14, 2011 to January 10, 2012. Here, we compare frequencies of top terms appearing in tweet samples collected via (a) the Firehose and (b) the Streaming API.

In our analysis, we used a variety of common statistical measures to compare aspects of the data collected by the Streaming API with that collected by a random sample from the Twitter Firehose. Our findings include the following:

  1. When estimating the top n hashtags in a dataset, the Streaming API data may be misleading when n is small (the estimate improves as n increases).
  2. Similarly, the topical distribution of tweets collected through the Streaming API becomes more representative as the amount of data collected increases.
  3. By analyzing retweet networks, we were able to show that we can identify, on average, 50-60% of the top 100 “key players” when creating networks based on one day of Streaming API data.
  4. Surprisingly, the Streaming API returns almost the complete set of geotagged tweets, despite sampling (shown below).
Here we see differences the distribution of geotagged tweets from both the Firehose and the Streaming API.  
Here we see similarities in the distributions of geotagged tweets from both the Firehose and the Streaming API. We captured over 90% of the geotagged tweets in the Streaming API.

To verify our results, we compare our Streaming API dataset with 100 synthetic Streaming API datasets created through random sampling of the Firehose. This way, we are able to show that the bias introduced by the Streaming API is significant.

What to learn more? Read our full paper, Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose, accepted at ICWSM 2013.

Fred Morstatter, Arizona State University
Jürgen Pfeffer, Carnegie Mellon University
Huan Liu, Arizona State University
Kathleen M. Carley, Carnegie Mellon University

About the author


Graduate student at Arizona State University. Data Mining, Machine Learning, Social Media Mining, Social Network Analysis. Advisor: Professor Huan Liu.

View all posts