Posts tagged "twitter"
We’ve launched a new demo based on our retweet analysis of 2011. The interface is similar to Google Trends and lets you search and compare keyword terms.

Click on the above picture to go to trends.twimpact.com.

The data is based on the 300000 most active retweets for each day based on the public Twitter feed, which is about 4.3 million tweets per day.

For more information, have a look at this blog post.

We’ve launched a new demo based on our retweet analysis of 2011. The interface is similar to Google Trends and lets you search and compare keyword terms.

Click on the above picture to go to trends.twimpact.com.

The data is based on the 300000 most active retweets for each day based on the public Twitter feed, which is about 4.3 million tweets per day.

For more information, have a look at this blog post.

Here is a nice demo Leo put together. You see a timelapse video of places people are talking about on Twitter during March 2011. Shown is the average activity over the last hour. On March 11, there was that huge earthquake in Japan which dwarves all other locations for quite some time.

For this demo, we’ve extracted place names from about 16000 cities from open street map data (about 500k variations all in all) and then matched these names in the tweets (i.e. we’re not using the geolocation but get the locations from the tweet texts themselves). The resulting stream is run through our analysis database to compute the location trends online.

We’re currently putting together a real-time version of this for our website.

Some numbers on our NIPS demo

So here are is a bit of background information on the data processing we’re doing for our NIPS demo. We’re currently reanalyzing the retweet trends for all of 2010. We cannot afford the firehose (but really, who can?), but the normal stream API gives more than enough data. It seems to be capped by about 50 tweets per second, but this still gives about 4.3 million tweets per day. The sampling seems to be quite reliable as well, meaning that we get a pretty representative sampling capturing all the important trends.

For the analysis, we’re keeping a “hot” set of the 300000 most active retweets in memory. From that we also compute trends for user mentions, hashtags, links, and our TWIMPACT impact score. We also keep graph data of which user has retweeted whom and which user has retweeted which tweet. We’re bounding the number of edges in those graphs as well by continuously discarding old links, resulting in about 550000 edges in the user-retweets-user graph, and about 15 million edges in the user-retweets-tweet graph.

All this data can be kept in about 6-8GB of memory. We’re periodically writing snapshots of the data every 8 “data hours” to disk in a custom format including indices which allows for quick access even without loading the snapshot into memory, with every file being about 1.5GB in size.

So from January to November, we have about 335 days. So far we’ve analyzed about 1.3 billion tweets with a stable rate of about 2000 tweets per second, without any serious attempt at multithreading. In the end, we expect to bring about 1.5TB of pre-analyzed data to NIPS which you can then explore at the demo.

Some insights from hunting for memory leaks

One of the main design decisions with our current approach to analyzing retweet activity on Twitter data is to keep all the “hot” data in memory while simultaneously bounding the amount of data we are willing to keep. This makes sense as only a tiny fraction of tweets are retweeted more than once at all, and you somehow have to bound the amount of “live” data to ensure that your performance is stable.

Now while re-analyzing the data for this years NIPS demo, we observed that memory was gradually filling up after a few weeks of analyzed data. So we went in to have a closer look.

The first thing we saw was that we had a lot of the original JSON strings in memory, while we’re often only referring to a small substring of the strings (let’s say the name of a user somewhere in a tweet). It turns out that the main reason for this is that Java tries to be clever with substrings (and also matches with regexs) and implements them as a restricted view of the original string, without copying the data. Which is fine in terms of speed most of the time, but a problem when you’re extracting only small bits of the data and actually want to discard the rest of the data after you’re done. The solution is simple, luckily, and consists in calling “new String()” for substrings which actually copies the data.

This alone reduced the pressure on memory significantly.

Eventually we figured out that there was one data structure whose growth was not explicitly bounded, the graph of users who have retweeted a tweet. The graph was implicitly bounded as eventually retweets would be removed if they have become old enough, but since some retweets have been retweeted more than one hundred thousand times (and I have no idea what it means since it’s in Indonesian), there were more than twenty million edges in that graph.

So finally we came up with a strategy which continuously aged edges in the graph to bound the overall growth to fifteen million edges, and now, finally, everything is running as stable as we want it to be with about 10-15GB of live data.

Three weeks to go for our NIPS demo

We’re preparing a reanalysis of all of our data from 2011 to bring to Granada. The reanalysis works in two phases: First retweets are analyzed sequentially for the whole year. This cannot be parallelized well as you need to know what happened so far to match retweets correctly (we’re also matching retweets which are not generated by Twitter but by people using the “RT” convention). In a second sweep, we will post-analyze the data to compute trends for links, hashtags, etc.

Current status: We’re about half way through of 2011 with the pre-analysis and have prepared the post-analysis.

Real-time seems to be the next big thing in big data. Map-Reduced has shown how to perform big analyses on huge data sets in parallel, and the next challenge seems to be to find a similar kind of approach to real-time.

When you look around the web, there are two major approaches out there which try to building something which can scale to deal with Twitter-firehose-scale amounts of data. One is starting with a MapReduce framework like Hadoop and somehow finagle real-time or at least streaming capabilities on it. The other approach starts with some event-driven “streaming” computing architecture and makes it scale on cluster.

These are interesting and very cool projects, however from our own experience with retweet analysis at TWIMPACT, I get the feeling that both approaches fall short of providing a definitive answer.

In short: One does not simply scale into real-time.

Read the whole post on Mikio’s blog

Last days Virginia earthquake in the hashtag cloud

Yesterday, there was a minor earthquake in Virginia at about 1:51pm local time. Within minutes, the earthquake became the dominant topic on Twitter. In the following, we track the development of this topic based on a real-time analysis of hashtag activity on Twitter. Size of the node represents activity of the hashtag in retweets, links are set if hashtags occur in the same tweet. The hundred most active hashtags are visualized.

At about 1:45pm, we have the normal activity at Twitter. Big teamfollowback cluster, as well as the usual suspects, damnitstrue and so on. There’s also a smaller cluster which reflects the recent events in Libya.

At 2:03pm the earthquake cluster begins to pop up, still about the same activity as the other major topics. Note how it’s linked to the other topics through generic hashtags like cnn or fb (standing for facebook.)

By 2:30pm, earthquake is totally dominating the other topics (note how their activity has been scaled down in comparison). Also note how that cluster is linked to the Libya and teamfollowback cluster through generic nodes like socialmedia or twitter.

The Inevitable Google+ Post

Google has really pulled it off this time. Googler Paul Allen has estimated that Google+ has already surpassed ten million users on July 10. Google has played the “closed beta” game very well, letting in only a small number of people who nevertheless started to flood the internet with posts about Google+, comparing it to Facebook and Twitter, evaluating it, sometimes dismissing it, more often being quite enthusiastic.

So, how is Google+? It’s a lot like Facebook, but it also feels somewhat unfinished. +1s (Google’s version of Facebook “like”) don’t show up in my page yet. The video group chat Hangout is actually quite cool and fun: You see everyone in thumbnail view at the bottom of your chat screen, and you can click on individual members to enlarge them. You have the usual mechanisms, you can post text, photos, videos, links, reshare posts, comment, +1 etc. The topic streams (called “Sparks”) seem to be quite unfinished, too.

For me, the biggest surprise is that there is no real-time search feature, there is only a search box for people (admittedly an important feature to build your network), and Sparks, where it is still a bit unclear of what it does exactly. I see Google+ as being somewhere between Facebook and Twitter, with the structural richness of Facebook but the more open “Public is default” policy of Twitter, and for this real-time search is an indispensable feature for discovery. I don’t use this feature often in Twitter, but when I do, it’s always immensely useful. The first time I realized this was when I was wondering whether the Debian Lenny has been released. I searched for “debian lenny” and directly got all the tweets where people were reporting that they were upgrading, so I knew. Other times, you can ride on the hashtag associated with some event to get real-time updates. These are both things which should be possible with Google+ as well, but currently, they aren’t.

In any case, we should not forget that Google+ is in beta. Sparks will hopefully improve over time, and Google real-time search will surely make a comeback, this time, fed on the Google+ updates. So far, Google has done impressively well (and given their lack of success with Buzz and Wave, suprisingly well) with their new product.

^MB

Twitter acquires Backtype

Twitter recently acquired backtype, a social media analysis company. Backtype provided all kinds of analysis and metrics to help companies analyze their performance on Twitter. With this acquisition, Twitter is continuing to incorporate services and know-how which has so far been provided by third parties. Previous examples were Summize for the real-time search back in 2008, TweetDeck for its Twitter client, and now backtype.

Backtype has been working on a number of projects to bring real-time analysis to a Hadoop cluster, including ElephantDB, a databased for exporting key-value data from Hadoop, Storm, a real-time analysis framework based on Hadoop, and Cascalog, a clojure based query language for Hadoop.

Current accounts at Backtype will continue to exist, but backtype already announced that they will stop to start new accounts.

From a view point of Twitter, I think this strategy makes perfect sense: First you provide rather open access to your data such that startups can start building interesting new products. Then you can pick the most interesting projects and incorporate them into Twitter. Basically, there is a lot of development of new ideas which is financed by the startup industry and which comes for free for Twitter, except for the winner, of course.

What has me worried is Twitter’s tendency in the past to openly discourage people from continuing to work in a field once they have acquired the relevant technology themselves, as has already happened with Twitter clients. Is the same going to happen with trending and social analysis tools next?

Twitter has always attracted people building new products and services because access to the data was relatively easy. However, this might change in the future if Twitter establishes a track record of closing down areas once it has what it thinks it needs, and people might become interested in more open platforms.

^MB

Google suspends real-time search

A lot of things happened that last week. Google opened their new service Google+ to a small set of people (don’t bother getting an invite now, they seem to have closed down the set of users for now), and also changed their layout for the search page and their calendar.

As a “side effect”, Google real-time search is apparently gone. Mashable confirms this in an article and gets some insight. Apparently, the agreement between Twitter and Google has ended, and Google did not extend it for now, with Google planning to focus on their own service Google+. Google states that it’s still crawling the publicly accessible pages from Twitter, which is of course hardly the same as having access to the Twitter stream.

I’m actually not that surprised as Google’s real-time search always felt like a half-hearted attempt at real-time search. I think the challenges are somewhat different between real-time search and web search.

  • Web pages are relatively static, both in content and relevancy. In real-time search, however, the information is changing very rapidly and has a high probablity of becoming obsolete quickly. This leads to quite different technological challenges, so that it’s probably hard to fit real-time search to Google’s existing infrastructure easily. The amount of cacheable information is also very limited.

  • Real-time search requires real-time relevancy measures. It doesn’t really make sense to just show the most recent messages matching your query. For popular events, there might be thousands of hits, swamping any really relevant hit after a few minutes or even seconds.

  • Displaying a list of all hits doesn’t make sense. Often, you have many near identical hits, and some form of aggregation would really be useful. Google is doing something like this for news, but news lives on a much slower timescale than real-time search.

Naturally, these are also topics we’re very much interested in at Twimpact. For example, our retweet based trending and user Twimpact score is a good starting point to get a better estimate of relevancy (This is currently demoed at our Japanese trending site). Currently, we’re also moving to an infrastructure which does most of the analysis in memory to deal with the real-time requirements. This allows us to process literally thousand of messages in real-time with relatively modes hardware requirements. You can get a glimpse on beta.twimpact.com.

As far as I remember Google had some plans of incorporating, for example, the social graph of a user to refine the search results, but I don’t know how far that went. Let’s see whether they take the time to create something better. In the interim, a site like Topsy gives you a more comprehensive real-time search feature than Twitter’s own search.

^MB

Some more links on the later state of Google’s real-time search:

Development blog for TWIMPACT
beta.twimpact.com

Members
Mikio Braun
Leo Jugel

twitter.com/twimpact

view archive