Some insights from hunting for memory leaks
One of the main design decisions with our current approach to analyzing retweet activity on Twitter data is to keep all the “hot” data in memory while simultaneously bounding the amount of data we are willing to keep. This makes sense as only a tiny fraction of tweets are retweeted more than once at all, and you somehow have to bound the amount of “live” data to ensure that your performance is stable.
Now while re-analyzing the data for this years NIPS demo, we observed that memory was gradually filling up after a few weeks of analyzed data. So we went in to have a closer look.
The first thing we saw was that we had a lot of the original JSON strings in memory, while we’re often only referring to a small substring of the strings (let’s say the name of a user somewhere in a tweet). It turns out that the main reason for this is that Java tries to be clever with substrings (and also matches with regexs) and implements them as a restricted view of the original string, without copying the data. Which is fine in terms of speed most of the time, but a problem when you’re extracting only small bits of the data and actually want to discard the rest of the data after you’re done. The solution is simple, luckily, and consists in calling “new String()” for substrings which actually copies the data.
This alone reduced the pressure on memory significantly.
Eventually we figured out that there was one data structure whose growth was not explicitly bounded, the graph of users who have retweeted a tweet. The graph was implicitly bounded as eventually retweets would be removed if they have become old enough, but since some retweets have been retweeted more than one hundred thousand times (and I have no idea what it means since it’s in Indonesian), there were more than twenty million edges in that graph.
So finally we came up with a strategy which continuously aged edges in the graph to bound the overall growth to fifteen million edges, and now, finally, everything is running as stable as we want it to be with about 10-15GB of live data.
- twimpact posted this