My talk “TWIMPACT: On Real-time Twitter Analysis” given at the Apache Hadoop Get Together in Berlin on April 18, 2012.
My talk “TWIMPACT: On Real-time Twitter Analysis” given at the Apache Hadoop Get Together in Berlin on April 18, 2012.
Talk given at the Apache Hadoop Get Together, Berlin, on April 18, 2012.
Here is a nice demo Leo put together. You see a timelapse video of places people are talking about on Twitter during March 2011. Shown is the average activity over the last hour. On March 11, there was that huge earthquake in Japan which dwarves all other locations for quite some time.
For this demo, we’ve extracted place names from about 16000 cities from open street map data (about 500k variations all in all) and then matched these names in the tweets (i.e. we’re not using the geolocation but get the locations from the tweet texts themselves). The resulting stream is run through our analysis database to compute the location trends online.
We’re currently putting together a real-time version of this for our website.
Mikio has created a data science stack on delicous (basically a link collection). We’ll try to add data science related articles there.
You might have stumbled upon the paper “Can Tweets Predict Citations? Metrics of Social Impact Based on Twitter and Correlation with Traditional Metrics of Scientific Impact” by Gunther Eysenbach in which a measure called “twimpact” is proposed. Unfortunately, this work has nothing to do with us, and as it seems, his paper also contains some methodological flaws. One can only wonder how he wasn’t aware of the “other” twimpact, a small search on twitter would have been enough to reveal that the name already exists… .
So here are is a bit of background information on the data processing we’re doing for our NIPS demo. We’re currently reanalyzing the retweet trends for all of 2010. We cannot afford the firehose (but really, who can?), but the normal stream API gives more than enough data. It seems to be capped by about 50 tweets per second, but this still gives about 4.3 million tweets per day. The sampling seems to be quite reliable as well, meaning that we get a pretty representative sampling capturing all the important trends.
For the analysis, we’re keeping a “hot” set of the 300000 most active retweets in memory. From that we also compute trends for user mentions, hashtags, links, and our TWIMPACT impact score. We also keep graph data of which user has retweeted whom and which user has retweeted which tweet. We’re bounding the number of edges in those graphs as well by continuously discarding old links, resulting in about 550000 edges in the user-retweets-user graph, and about 15 million edges in the user-retweets-tweet graph.
All this data can be kept in about 6-8GB of memory. We’re periodically writing snapshots of the data every 8 “data hours” to disk in a custom format including indices which allows for quick access even without loading the snapshot into memory, with every file being about 1.5GB in size.
So from January to November, we have about 335 days. So far we’ve analyzed about 1.3 billion tweets with a stable rate of about 2000 tweets per second, without any serious attempt at multithreading. In the end, we expect to bring about 1.5TB of pre-analyzed data to NIPS which you can then explore at the demo.
First stage of our analysis of 2011 is shortly coming to an end. So far, we’ve analyzed about 1.3 billion tweets… .