Timely location of useful information during crises is critical for those forced to make life altering decisions. To stay informed, emergency responders and affected individuals increasingly rely on social media platforms, specifically on Twitter. To obtain relevant information they typically use one of the two main strategies to query Twitter:
- Keyword-based sampling: Track the tweets that contain a set of manually identified keywords or hashtags specific to a crisis, such as #sandy for Hurricane Sandy or #bostonbombings. Yet, keywords are only as responsive as the humans curating them and, indeed, in our data, such searches returned only a fraction of the relevant tweets—only 18% to 45% of the crisis-relevant tweets were retrieved, with an average of ~33%.
- Geo-based sampling: Track the tweets that are geo-tagged in the area of the disaster. Alas, by doing so out of the returned tweets only a small percentage are actually about the disaster—only 6% to 26% from the returned tweets are crisis-relevant, with an average of ~12.5%.
Efficiently collecting crisis-relevant information from Twitter is challenging due to the laconic language and the Twitter’s API for accessing tweets in real-time (the streaming API) limitations. Twitter can be queried by content, through the use of up to 400 keywords, or by geo-location. Specifically, if both keywords and geo-locations are given the query is interpreted as a disjunction (logical OR) of both. This is undesirable, as the public API gives access to only 1% of the data, and if the query matches on more data than that, it will return a random sample from it. Thus, as the query becomes more broad, after some point we start losing data.
To overcome these limitations, we built CrisisLex—a lexicon of terms that frequently appear in tweets posted during a variety of crises. By querying Twitter using CrisisLex, we obtain better trade-offs between how much relevant data we retrieve and how clean that data is. The lexicon contains terms such as:
- affected people
- people displaced
- donate blood
- text redcross
- stay safe
- crisis deepens
- toll raises
CrisisLex has two main applications:
- Increase the recall in the sampling of crisis-related messages (particularly at the start of the event), without incurring a significant loss in terms of precision.
- Automatically learn the terms used to describe a new crisis and adapt the query with them.
Consequently, CrisisLex requires no manual intervention to define or adapt the query. This is particular useful, as the manual identification of keywords requires time which, in turn, may result in losing tweets due to latency. In addition, using CrisisLex does not only retrieve more comprehensive sets of crisis-relevant tweets, but it also helps to preserve the original distribution of message types and message sources.
For more detailed results on how we build and tested CrisisLex please check our paper: CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises. If you want to use CrisisLex to collect tweets, and/or want to build your own lexicon for other domains (e.g., health, politics, sports) please check our code and data (in accordance with the terms of service of Twitter’s API) at CrisisLex.org