During our Bay Area trip two weeks ago, we had the chance to chat with Ben Lorica, Chief Data Scientist of O’Reilly Media at the Ritual Roaster Coffee shop in the Mission in San Francisco. It turns out, what we did very much resonated with Ben who had recently become interested in alternatives to scaling, and single-server systems. He was kind enough to write this great blog post about streamdrill.
The last few weeks we’ve been working on extracting the real-time analysis engine behind TWIMPACT’s social media demos. The result is streamdrill which we’ve just launched in a beta version.
Streamdrill is a real-time event processing engine which solves the top-k problem. You can pipe in up to several 10k events per second and instantaneously query the most active entries over the past minute, hour, day, or week.
If you’re interested, we’ll spin up a small instance for you to play with.
We already have clients in Python and Scala available here.
Many of the tools like Hadoop or NoSQL data bases are quite new and are still exploring concepts and ways to describe operations well. It’s not like the interface has been honed and polished for years to converge to a sweet spot. For example, secondary indices have been missing from Cassandra for quite some time. Likewise, whether features are added or not is more driven by whether it’s technically feasible than whether it’d make sense or not. But this often means that you are forced to model your problems in ways which might be inflexible and not suited to the problem at hand. (Of course, this is not special to Big Data. Implementing neural networks on a SQL database might feasible, but is probably also not the most practical way to do it.)
While an interesting read I’m not sure I really got it—my understanding is that the author’s advise is that disregarding your backend storage or Big Data architecture, you should always think of your data and processing tools in terms of higher concepts as data structures, operations on data structures, and processing algorithms.
At TWIMPACT, we’re a big fan of stream mining algorithms to do real-time event processing. One of their interesting features is that they let you trade exactness for computation time. However, people often ask us why that won’t be a problem. In this post, I collect 4 reasons why you don’t want your real-time big data analytics to be exact.