Some numbers on our NIPS demo
So here are is a bit of background information on the data processing we’re doing for our NIPS demo. We’re currently reanalyzing the retweet trends for all of 2010. We cannot afford the firehose (but really, who can?), but the normal stream API gives more than enough data. It seems to be capped by about 50 tweets per second, but this still gives about 4.3 million tweets per day. The sampling seems to be quite reliable as well, meaning that we get a pretty representative sampling capturing all the important trends.
For the analysis, we’re keeping a “hot” set of the 300000 most active retweets in memory. From that we also compute trends for user mentions, hashtags, links, and our TWIMPACT impact score. We also keep graph data of which user has retweeted whom and which user has retweeted which tweet. We’re bounding the number of edges in those graphs as well by continuously discarding old links, resulting in about 550000 edges in the user-retweets-user graph, and about 15 million edges in the user-retweets-tweet graph.
All this data can be kept in about 6-8GB of memory. We’re periodically writing snapshots of the data every 8 “data hours” to disk in a custom format including indices which allows for quick access even without loading the snapshot into memory, with every file being about 1.5GB in size.
So from January to November, we have about 335 days. So far we’ve analyzed about 1.3 billion tweets with a stable rate of about 2000 tweets per second, without any serious attempt at multithreading. In the end, we expect to bring about 1.5TB of pre-analyzed data to NIPS which you can then explore at the demo.
Some insights from hunting for memory leaks
One of the main design decisions with our current approach to analyzing retweet activity on Twitter data is to keep all the “hot” data in memory while simultaneously bounding the amount of data we are willing to keep. This makes sense as only a tiny fraction of tweets are retweeted more than once at all, and you somehow have to bound the amount of “live” data to ensure that your performance is stable.
Now while re-analyzing the data for this years NIPS demo, we observed that memory was gradually filling up after a few weeks of analyzed data. So we went in to have a closer look.
The first thing we saw was that we had a lot of the original JSON strings in memory, while we’re often only referring to a small substring of the strings (let’s say the name of a user somewhere in a tweet). It turns out that the main reason for this is that Java tries to be clever with substrings (and also matches with regexs) and implements them as a restricted view of the original string, without copying the data. Which is fine in terms of speed most of the time, but a problem when you’re extracting only small bits of the data and actually want to discard the rest of the data after you’re done. The solution is simple, luckily, and consists in calling “new String()” for substrings which actually copies the data.
This alone reduced the pressure on memory significantly.
Eventually we figured out that there was one data structure whose growth was not explicitly bounded, the graph of users who have retweeted a tweet. The graph was implicitly bounded as eventually retweets would be removed if they have become old enough, but since some retweets have been retweeted more than one hundred thousand times (and I have no idea what it means since it’s in Indonesian), there were more than twenty million edges in that graph.
So finally we came up with a strategy which continuously aged edges in the graph to bound the overall growth to fifteen million edges, and now, finally, everything is running as stable as we want it to be with about 10-15GB of live data.
Three weeks to go for our NIPS demo
We’re preparing a reanalysis of all of our data from 2011 to bring to Granada. The reanalysis works in two phases: First retweets are analyzed sequentially for the whole year. This cannot be parallelized well as you need to know what happened so far to match retweets correctly (we’re also matching retweets which are not generated by Twitter but by people using the “RT” convention). In a second sweep, we will post-analyze the data to compute trends for links, hashtags, etc.
Current status: We’re about half way through of 2011 with the pre-analysis and have prepared the post-analysis.
We have been accepted for the Demonstration track at NIPS 2011. We plan to do a real-time demo as well as bringing a whole year worth of history with us so that you can go back in time and relive the key events of last year on twitter.
Looking forward to see you in Granada!