Some numbers on our NIPS demo

So here are is a bit of background information on the data processing we’re doing for our NIPS demo. We’re currently reanalyzing the retweet trends for all of 2010. We cannot afford the firehose (but really, who can?), but the normal stream API gives more than enough data. It seems to be capped by about 50 tweets per second, but this still gives about 4.3 million tweets per day. The sampling seems to be quite reliable as well, meaning that we get a pretty representative sampling capturing all the important trends.

For the analysis, we’re keeping a “hot” set of the 300000 most active retweets in memory. From that we also compute trends for user mentions, hashtags, links, and our TWIMPACT impact score. We also keep graph data of which user has retweeted whom and which user has retweeted which tweet. We’re bounding the number of edges in those graphs as well by continuously discarding old links, resulting in about 550000 edges in the user-retweets-user graph, and about 15 million edges in the user-retweets-tweet graph.

All this data can be kept in about 6-8GB of memory. We’re periodically writing snapshots of the data every 8 “data hours” to disk in a custom format including indices which allows for quick access even without loading the snapshot into memory, with every file being about 1.5GB in size.

So from January to November, we have about 335 days. So far we’ve analyzed about 1.3 billion tweets with a stable rate of about 2000 tweets per second, without any serious attempt at multithreading. In the end, we expect to bring about 1.5TB of pre-analyzed data to NIPS which you can then explore at the demo.

  1. brigid-davis reblogged this from twimpact
  2. twimpact posted this

Development blog for TWIMPACT
beta.twimpact.com

Members
Mikio Braun
Leo Jugel

twitter.com/twimpact

view archive