One of the brilliant things about working closer to the technology as I do at Full360 is that every once in a while you get to see things in a completely different light once you know that you have a certain capability. Right now, I do a lot more thinking about transactions per second and streaming real time data. So I am extending this thinking primarily through Amazon Kinesis and VoltDB. I'm only halfway through the exercise of building a prototype for a new demo I'm thinking about but it's the thinking about applications that's most interesting to me now.
A couple weeks ago I presented a webinar in which I discussed one of our realtime apps which is attached to online gaming. On the back end of this architecture we are processing about 30,000 transactions per minute with a small two shard cluster of VoltDB. This setup hardly makes a peep over 25% CPU utilization running 24/7 with no downtime in two years. Cool enough, but then you realize that is handling more mobile transactions per year than PayPal. I did the math. And oh what fun it is to do this kind of math and realize what's computable for the kind of analytic data frameworks we build.
So what about telephone calls? Well there are about 10M people in LA County and according to some research I did, they make and average of 10-15 phone calls a day. That's 150M transactions per day, divided by 86400 (86400 being a new number I think about: the number of seconds in a day) gives me an average just under 1800 TPS. That's a decent enough stream, but really not difficult for Volt to handle, since I know Volt has a massive ingestion capacity and nice interfaces to stuff like Kafka, and Kinesis can be thought of as a flavor of Kafka. My idea here is to design a multi-tier database framework using Volt and Vertica that employs a lambda style architecture for two levels of analysis. The example is a terrorist watchlist. I've picked this example because it's something that is memorable and it offers a lot of interesting details along the way.
So first a couple clarifications.
If we chunk fast data processing down into three classes, let's call them real-time, streaming, and micro batching. I have come upon this distinction in the process of building a fake data engine for my application. My engine is a sensor and that sensor is something I have embedded in the telephone infrastructure. Using the Cisco Unified Communications Manager standard release 7.1 as the basis, I have identified the CDRs. Call detail records will tell me who is calling whom and for how long they speak. In researching this, I discovered another way to think about 'big data', It turns out that these CDR databases are flatfiles and they can contain a max of 2 million records or 6GB large. But I'm not going to batch out those big suckers, I'm going to tail them and spit them out record by record.
A real-time system is something I generally would say is something purpose built, IE there's a special OS kernel that does not interrupt the main job of the software. So basically embedded systems, right? Anything that's not software purpose built for hardware that too is purpose built is streaming at best. That's the conceptual distinction in my mind and I like that. Practically speaking, you can stream at 10ms increments, or even down to 2 or 3ms which is pretty damned near real time, but it's actually not a real-time system from our POV because we're doing analytics, not onboard AI. Consider the fact that 25ms is considered a good standard for online gaming, we are basically looking at the limits of human perception. How fast is that? Here's a good place for you to test what's fast in terms of your own perception between 1 and 320ms. So if I'm presenting data to a human being for a decision, nobody's going to expect you to make other than a hand-eye coordination twitch style decision in under 100ms. So if you're a drone pilot, you're dealing with a streaming system. A batch system is one that basically runs on a periodic basis beyond a couple minutes or so.
The system I created to generate phone records is a micro batching producer into a Kinesis stream. Now I can generate 50,000 records in about 1.7 seconds; that's clearly way more than I need for my peak TPS target of 5x 1800 = 9000TPS in the LA County phonecall scenario. Also, the Kinesis API expects me to send records at a max of 500 records per call. So that ingestion is kind of slow. Then again, I'm single threading that push. Even so, as I batch up 16 pushes for the peak of 8000, I get that done in 7 seconds. So the lag on the Kinesis side from my single producer isn't really a big deal considering that I'm generating the original 8000 with the same single timestamp. So I'm effectively catching 1100 TPS through Kinesis which is providing a bit of input delay, then again, I'm only running 3 shards of Kinesis and I'm not sending parallel sets of data. I'm sure I could step that up, and in fact I know that I could push records way faster directly to VoltDB if I wanted to. But I really wanted to demonstrate Kinesis and Volt. Right now, I don't know the economics of the two.
Additionally, I know that Kinesis will spit faster than it sucks. So our interesting performance will be seen on the next phase which is the ingestion speed of VoltDB as a Kinesis consumer. I should have some results for you next week. The cool thing was that I basically got the Kinesis part running in under an hour. As soon as I figured out that I was using an old version of the aws-sdk, all my problems went away and I was able to do what I needed. Now I suspect that in the long run I will likely be much happier doing the high performance thing directly with Volt. And as I have legoed these components together I can already see some cracks in the idea.
Nevertheless, here's the scenario. I have a watchlist of 50 phone numbers. I start up my demo which kicks off a container with my phone record generator. The generator cranks out half a million phone calls simulated phone calls. I ingest them into Kinesis. 8 minutes later, I've got half a million stashed into Kinesis. Now I push the next button and this represents a situation in which Volt is eating the Kinesis stream as if it were the originator of the streamed data. I don't quite understand the reason for the asymmetry of Kinesis' slow read and its fast write speed, but... In this phase, VoltDB is getting streamed transactions at a pretty low latency. According to my stats, I'm doing an average approaching 60ms ingestion latency into Kinesis. The output latency is more like <1 on the scale they're showing me. So that will be how I simulate a fast streaming dump. Obviously I could get Volt to read the half million records from a flat file, but that's not really streaming is it?
So once the streaming records start hitting VoltDB, I will trigger events based upon my watchlist. And that will represent real-time flags on phone calls. Now the glitch here is that basically the phone calls are over by the time I have the CDR, but we'll ignore that for the moment. Again, I'm just simulating a fast stream. But for demo purposes, you can see that we will be able to immediately flag the phone calls on the watchlist. What we're also going to do is map the source or target phone calls and immediately add them to the watchlist. That's the second VoltDB task. Now all subsequent alert records are pushed downstream to Vertica where we do some historical surveillance. How many calls did our subjects make? What about the second order subjects? It's in the Vertica database where we add a bunch of metadata and do deeper analysis, but in the stream that we catch our perpetrators in the act. All this without actually checking the content of the phone calls.
That's the state of the demo right now. I'm only using one kind of CDR out of the 50 or so that exist, so I'm not checking for conference calls or hangups. I also am mixing up area codes at about a .47 ratio of inter- to extra-area code calls, because right now the purpose is to identify cells in Los Angeles County. I'll probably throw in some international style calls and that could be a third rule for flagging an alert in Volt.
Comments