I noticed some folks reading through the Redshift category and noticed that I haven't written anything new for a while. So here's what's new:
We see that Redshift has improved its vacuum capabilities and added more functionality all around. It has improved its performance all around, but it hasn't changed its overall performance characteristics. Redshift is not fundamentally different after two years. It still behaves like Redshift when compared to Vertica, the other MPP columnar database we support at Full360.
We have been able to learn quite a bit more about tuning Redshift. Full360 will be offering this service soon. It's called Upshift and it is surely the most comprehensive performance evaluation available in the industry.
I have to qualify all of this by saying that I personally work a lot more with Vertica than I do with Redshift. These two products may seem very similar but the details are often overwhelming. Fortunately we are developing a methodology that expresses the rules for optimization very well. So while the characterization I've made still holds true, there will be a growing number of exceptions and interesting circumstances. I call Vertica a magical sword. It is powerful, precise and it sharpens itself. I can cut intricate and delicate patterns, and chop hundreds of large heads. I call Redshift an ogre's club. It is massively powerful, brain dead simple to use and relatively inexpensive. So you basically have to look at your application and know whether or not it is a job for a club or a sword. Our methodology will tell you exactly which, but like I said, the devil is in the details and we are wrangling dozens of demons.
The good news is that both products are improving at a good clip. Still, I confess I'm paying more attention to Vertica. I'm really impressed with the overview I got yesterday on Vertica 8. They've optimized some of their geospatial algorithms. They've incorporated several ML features directly into the core product. They've dramatically improved their integration with Hadoop, Spark and Kafka. They're claiming to perform 160% of Impala. So that's superb. Most importantly there is enthusiasm for the creation of the new company, which is less like Vertica getting sold to Microfocus and more like a rebirth of Microfocus itself who is, by the way, the owner of Suse Linux. The Vertica guys are thrilled that they'll be working for a software company. That means the upcoming integration with S3 is serious as is their priority on cloud implementations. All good.
A couple weeks ago, something profound occurred to me about enterprise software. I realized that when it is priced by server or by processor, that it's a ripoff. When it is priced by data usage, it's a bargain. You are paying for the potential of being able to use 8 cores even when you are not using them. Admittedly this is rather easy to see if you have experience working in the cloud and then take a turn doing on-premise practices, but it seemed rather profound to me. Imagine, if this is not obvious to you, that you pay a standard or premium license fee to your mobile phone carrier based upon whether you are using a brand new smartphone or an old one. Right. It makes no sense to pay money for anything but the minutes you use on the phone. Mobile phone billing is done right. You pay for minutes. All software could be that way, so that you're not paying for servers, but for functions that spring to life do your bidding, charge you a fractional penny and then die.
This week my boss sent me a link to a guy named swardley who likes ducks. It turns out that this character is quite capable of blowing my tiny mind. He's done it twice already, and so now I have the burden of a speck of enlightenment. It is an enlightenment which is commensurate with my understanding of (Wladawsky-Berger 1999) n-tier computing and then (Vogels 2008) horizontal scaling. So since I've been doing cloud for six years, I now understand what's happening next. God save us all.
Tying software development to economics and cost accounting has long been the stuff of magic, SWAG and charlatans. At least that's what it seemed like for most of my career. But I think Simon Wardley has the solution. He has outlined an ecosystem and a framework for understanding how one can iterate (captive) algorithms towards mutual value for the developers and the customers. He calls it FinDev.
Like most useful thinking on the progressive edge of IT, one must assume AWS. That is to say, very little of what exists in the world that is not part of AWS's ecosystem can be thought of as having great potential in the future of computing. AWS is beyond doing things very well, they are evolving at a monstrous pace and at enormous scale. As an aside, I asked some Agilists last night at El Torito why Amazon manages its businesses so well. He said it's because Amazon is a collection of relatively small businesses that work on a common billing system, and that's what keeps it simple to manage. They are not only eating, but profiting from their own dogfood, which is the fully meticulous tracking of compute resource costing, and now with Lambda, down to the function. All the hardware is a sunk cost. Cloud computing is a utility. What matters now is billing by the function. Simply assume the cloud. It's already done.
So there is a scary aspect to this which is something we all should have been afraid of all the time. It is what happens to craft when things become industrialized. Your personal touch matters less in a market which is defined towards optimization, cost-cutting and efficiency. Nobody cares how you show off the horse, we're all driving cars now. Nobody cares about your budget system, we're all using SAP now. Nobody cares about your fat client, we're all using browsers now. What's coming is a COTS revolution in which your college professor optimized the Towers of Hanoi solver and now owns the moneyTicker on the algo. In a global library of cloud interoperable functions, the scope of what you get to work on gets narrower and narrower. The good news is that we are 20 years away from lockdown. The V8 of compute engine economies is invented. Say hello to the next 50 years. You Wankels don't stand a chance. When I was an undergrad, I used to think of software as the same thing as law. There are lots and lots of lawyers but only a few legislators. The assumption was that the best lawyers at some point got to legislate and the rest just interpreted and borrowed citations for the benefit of those who never read the law. I believe there will be some measure of stare decisis in the new FinDev ecosystem.
So the future belongs to engineers who really know their customer's needs. The economy of FinDev provides value to developers and customers only to the extent that something can be built (at scale in the cloud) that customers want to use. When you charge by the use, that's a different business model than anything we've seen. Chances are it will be disruptive because it will go after captive inefficiently spent money. But there's greenfield out there too. More hopefully, there are new places computing can and will go once we wean ourselves from the economics of capacity planning, system depreciation, outsourced consulting and all that. I think AWS will be capable enough to handle global innovation in this regard; they're certainly leading. Now is the time to work our way towards best practices, evolving towards the revolution.
In Martin Cruz Smith's Arkady Renko series, the protagonist, Renko informally adopts an orphan who is a chess genius. Playing at the genius level, the kid doesn't require a board or pieces. He, and those like him, can just recite moves. He has a virtual queen and doesn't even need hardware. If you're thinking about physically moving pieces, you're not playing chess.
At Full360, we have developed a best practice around our design of greenfield and re-engineered DW applications. The following is a high level guide to how we accomplish this in Vertica. Vertica optimization is something we have pursued with vigor at Full360. There are several different levels at which this can be pursued. Implicit is the modularization of the applications so that the major functions of our data management philosophy can be expressed discretely. But let's get to the $10 words, shall we?
Idempotency Both I and JDub could go on at to absurd lengths about how important this is to the modularizationf DW design. I will simply, which characteristic casualness, tell you that it makes all of our stuff idiot proof, in that it makes our data provision dependencies kind of go away. The basic idea is that in application units(which basically means the chunks at which the data to be consumed makes process and business sense) you make your input streams discrete. So when your input streams are discretely chunked, you can run your process over and over without concern about whether it has been done once, twice or never. You just run that independent data provision job and it creates the right sized bucket of data.
Set Transformational VLDB folks are probably familiar with why you do ELT rather than ETL. The simple way of saying it is that database developers are more stingy and efficient with data than ETL developers. I developed a taste for hand-crafted 'ETL' back in the days when Informatica was a baby, and having my Unix biases, I always loved moving files around. At the time, my focus was on Essbase which had not ETL hooks, even though Arbor should have purchased an ETL company cheap. Interestingly tangential, Wall Street has never been very long in ETL companies. Anyway, I expect that Informatica and Talend will not like to hear that their days are numbered, but then neither did Carleton and DataStage, and they used to rule the world. The bottom line is that moving data from table to table using SQL rather than in a GUI that does not is going to be, in certain databases, much faster and more human readable than in third party tools. So we do set transformation, and even regex stuff, inside Vertica. One day we may even benchmark UDFs against external programs.
Denormalized Vertica like Redshift is not a transaction database. It is columnar and it easily handles 600, 800, 1200 column tables. It was designed to. So there is no reason to do a lot of silly little joins in silly little tables to get juicy fat data. We make all that part of the ingestion process, which gives us what we want. Think about it for a minute. Consider the volatility of lookup tables and dimensions as compared to the volatility of atomic facts and transactions, aggregated or otherwise. The facts will be bigger and more fluid. So why spend join energy on query exhibits over the long haul when you can easily have all the columns you want? You don't have to. There are no table scans from hell, that's what columnar solves. So we go big.
Production I'm not going to talk about the guts of Production other than to mention it briefly here. Production is some of the genius and we have a bag of tricks that is ever-expanding as we deal with realtime, near-realtime, and other odd types of data sources. Yes we lambda with streams and lakes, but we smart lambda. Again with whatever tech makes sense. Right now we're playing with Kinesis and Kafka and our own custom Actor models which we're sure will evolve over time. We're also looking at how to use Redis and other superfast KV stores. So I suspect we will grow many efficient tentacles as we Produce standardized data for ingestion into our columnar DW apps. Nuff said.
We ingest data into source tables for each schema as they come to us. No matter how many fields, large or small, we will take them in using a COPY from produced files. Whether in Vertica or Redshift we standardize on UTF-8 with vertical bars delimited and a backslash escape. In some cases, if we've munged up variable length stuff from our own custom regex routines or other JSON, AVRO or semi-structured effluvia, we will have an additional pre-step using Vertica Flex Tables. We are coming up with best practices there too.
These source files should also retain the original names of the fields of the produced data when possible. This assists in debugging with the original developers.
All of the data that is to be used in the application should then be mapped to a view. This is the staged data. Staged data should be of the increment of ingestion (discretely chunked in application consumption units). That is to say that your _SRC and _STG will carry the same number of records although they are likely to carry a different number of fields once the ingestion is done.
The clean tables should be materialized tables with human-readable fields that are conceptually discrete. This is generally accomplished through a direct insert from one or more _STG views. Clean tables can be defined with blank fields for further rule application. Clean tables that contain all history for the query space should have the term _HIST_CLN, otherwise one should assume that a clean table is the same size as an ingestion increment. Clean tables should be optimized for the scope of the transactions that take place in their creation. When you are looking at the clean tables, you have all of the data you need from all of the sources presented in the way you as a developer think about the data. There should be very little ambiguity at this point, it's your best look at the details before they are aggregated and rationalized.
Lookup tables are straightforward. They should be optimized for their transaction capacity with clean tables.
In a complex model, clean tables can be made into Intermediate Fact tables. The intermediate fact tables are materialized tables with all of the necessary dimensions that support measures that can be made across the full query space. These may or may not be exhibited directly, but should be useful in a partial analysis of the particular measures they contain. An intermediate fact table should be the place where window functions are applied. They should be the place where obscure field conditions are made into explicit attribute and status fields. It is important to know that an application may have dependencies of different measures that seem to be dimensionally equivalent, but actually aren't. So by using IFTs we unburden ourselves of the very idea of a single massive star or snowflake that might have holes. Also, we can capture all of the attributes of a set of measures completely without concerning ourselves with the weight of them in the final presentation layer.
So think about this. A clean look of the data will probably not have sensible status fields they will have codes. There may be multiple ways to interpret a certain combination of fields. So whatever you need to support a the full scope of consumable data, even if it means synthetically remastering transactions, you can do with IFTs beforehand. Building these are the real guts of the DW application, and it's where the fun of working with sharp analysts come in, especially when it comes to data integration projects. I will apply all of the reasonable and semi-reasonable business rules here. This is where I have enough detail to point machine learning, because at this point the data is explicit and human readable. It will reveal the more interesting cases and outliers. And here it should be very rich - beyond human comprehension, so yeah maybe you don't go full wide with these tables, only add 4 of the 50 psychographic customer tags you have against their ID, because, security.
Exhibit tables are materialized views in some cases but generally views that are presentable to the end user. These should be indexed and optimized for query retrieval. Security access rules are applied to user groups, etc only to exhibit tables. It should be assumed that none other than dbadmin processes will have access to any tables or views but the exhibit tables.
So there it is. This is rather the state of the art that I have internalized in five years of cloud based columnar data warehousing. Maybe I should write a book.
I just put together a quick casual video covering five questions about trends in markets and customers that we're seeing. It's nice to see that our experience is exactly dovetailing with theresearch put out by Gartner's latest Magic Quadrant
Here's a link to the webinar I did last week with VoltDB. Having Volt as part of our architecture has enabled us to think about a whole new class of applications. Right now, I would say that we're at the point where we're really ready to deal with massive IOT streams. It's just a matter of getting the right people together. These days I'm brainstorming this kind of stuff, excited as I am by looking at real-time events and figuring out what I need to drive those data dogies.
In this presentation I talk about three different apps that are part of our Fast Data portfolio. All of them are real customers using this technology in production as part of Full360's manage service offerings. We designed these apps and built them as well. All of them run securely and reliably in AWS VPCs and customers love them. These are exemplary of our multi-tier DW framework that we call elasticBI. Why Panigale? Because this is all about making very fast decisions in real time. A second late is too late.
One of the brilliant things about working closer to the technology as I do at Full360 is that every once in a while you get to see things in a completely different light once you know that you have a certain capability. Right now, I do a lot more thinking about transactions per second and streaming real time data. So I am extending this thinking primarily through Amazon Kinesis and VoltDB. I'm only halfway through the exercise of building a prototype for a new demo I'm thinking about but it's the thinking about applications that's most interesting to me now.
A couple weeks ago I presented a webinar in which I discussed one of our realtime apps which is attached to online gaming. On the back end of this architecture we are processing about 30,000 transactions per minute with a small two shard cluster of VoltDB. This setup hardly makes a peep over 25% CPU utilization running 24/7 with no downtime in two years. Cool enough, but then you realize that is handling more mobile transactions per year than PayPal. I did the math. And oh what fun it is to do this kind of math and realize what's computable for the kind of analytic data frameworks we build.
So what about telephone calls? Well there are about 10M people in LA County and according to some research I did, they make and average of 10-15 phone calls a day. That's 150M transactions per day, divided by 86400 (86400 being a new number I think about: the number of seconds in a day) gives me an average just under 1800 TPS. That's a decent enough stream, but really not difficult for Volt to handle, since I know Volt has a massive ingestion capacity and nice interfaces to stuff like Kafka, and Kinesis can be thought of as a flavor of Kafka. My idea here is to design a multi-tier database framework using Volt and Vertica that employs a lambda style architecture for two levels of analysis. The example is a terrorist watchlist. I've picked this example because it's something that is memorable and it offers a lot of interesting details along the way.
So first a couple clarifications.
If we chunk fast data processing down into three classes, let's call them real-time, streaming, and micro batching. I have come upon this distinction in the process of building a fake data engine for my application. My engine is a sensor and that sensor is something I have embedded in the telephone infrastructure. Using the Cisco Unified Communications Manager standard release 7.1 as the basis, I have identified the CDRs. Call detail records will tell me who is calling whom and for how long they speak. In researching this, I discovered another way to think about 'big data', It turns out that these CDR databases are flatfiles and they can contain a max of 2 million records or 6GB large. But I'm not going to batch out those big suckers, I'm going to tail them and spit them out record by record.
A real-time system is something I generally would say is something purpose built, IE there's a special OS kernel that does not interrupt the main job of the software. So basically embedded systems, right? Anything that's not software purpose built for hardware that too is purpose built is streaming at best. That's the conceptual distinction in my mind and I like that. Practically speaking, you can stream at 10ms increments, or even down to 2 or 3ms which is pretty damned near real time, but it's actually not a real-time system from our POV because we're doing analytics, not onboard AI. Consider the fact that 25ms is considered a good standard for online gaming, we are basically looking at the limits of human perception. How fast is that? Here's a good place for you to test what's fast in terms of your own perception between 1 and 320ms. So if I'm presenting data to a human being for a decision, nobody's going to expect you to make other than a hand-eye coordination twitch style decision in under 100ms. So if you're a drone pilot, you're dealing with a streaming system. A batch system is one that basically runs on a periodic basis beyond a couple minutes or so.
The system I created to generate phone records is a micro batching producer into a Kinesis stream. Now I can generate 50,000 records in about 1.7 seconds; that's clearly way more than I need for my peak TPS target of 5x 1800 = 9000TPS in the LA County phonecall scenario. Also, the Kinesis API expects me to send records at a max of 500 records per call. So that ingestion is kind of slow. Then again, I'm single threading that push. Even so, as I batch up 16 pushes for the peak of 8000, I get that done in 7 seconds. So the lag on the Kinesis side from my single producer isn't really a big deal considering that I'm generating the original 8000 with the same single timestamp. So I'm effectively catching 1100 TPS through Kinesis which is providing a bit of input delay, then again, I'm only running 3 shards of Kinesis and I'm not sending parallel sets of data. I'm sure I could step that up, and in fact I know that I could push records way faster directly to VoltDB if I wanted to. But I really wanted to demonstrate Kinesis and Volt. Right now, I don't know the economics of the two.
Additionally, I know that Kinesis will spit faster than it sucks. So our interesting performance will be seen on the next phase which is the ingestion speed of VoltDB as a Kinesis consumer. I should have some results for you next week. The cool thing was that I basically got the Kinesis part running in under an hour. As soon as I figured out that I was using an old version of the aws-sdk, all my problems went away and I was able to do what I needed. Now I suspect that in the long run I will likely be much happier doing the high performance thing directly with Volt. And as I have legoed these components together I can already see some cracks in the idea.
Nevertheless, here's the scenario. I have a watchlist of 50 phone numbers. I start up my demo which kicks off a container with my phone record generator. The generator cranks out half a million phone calls simulated phone calls. I ingest them into Kinesis. 8 minutes later, I've got half a million stashed into Kinesis. Now I push the next button and this represents a situation in which Volt is eating the Kinesis stream as if it were the originator of the streamed data. I don't quite understand the reason for the asymmetry of Kinesis' slow read and its fast write speed, but... In this phase, VoltDB is getting streamed transactions at a pretty low latency. According to my stats, I'm doing an average approaching 60ms ingestion latency into Kinesis. The output latency is more like <1 on the scale they're showing me. So that will be how I simulate a fast streaming dump. Obviously I could get Volt to read the half million records from a flat file, but that's not really streaming is it?
So once the streaming records start hitting VoltDB, I will trigger events based upon my watchlist. And that will represent real-time flags on phone calls. Now the glitch here is that basically the phone calls are over by the time I have the CDR, but we'll ignore that for the moment. Again, I'm just simulating a fast stream. But for demo purposes, you can see that we will be able to immediately flag the phone calls on the watchlist. What we're also going to do is map the source or target phone calls and immediately add them to the watchlist. That's the second VoltDB task. Now all subsequent alert records are pushed downstream to Vertica where we do some historical surveillance. How many calls did our subjects make? What about the second order subjects? It's in the Vertica database where we add a bunch of metadata and do deeper analysis, but in the stream that we catch our perpetrators in the act. All this without actually checking the content of the phone calls.
That's the state of the demo right now. I'm only using one kind of CDR out of the 50 or so that exist, so I'm not checking for conference calls or hangups. I also am mixing up area codes at about a .47 ratio of inter- to extra-area code calls, because right now the purpose is to identify cells in Los Angeles County. I'll probably throw in some international style calls and that could be a third rule for flagging an alert in Volt.
I'm browsing in an attempt to understand my marketing job a little bit better. Look what I found
So if you want to know the difference between real clouds and fake clouds, here is a sample. First, sadly, I got an email today from ODTUG and found the acronym EPCRS or something like that. So I went to the Oracle website and found this non-embeddable animation on what it's about. So literally, they don't even show the product, and what they do show isn't even compatible with blogs. I mean it comes up in a popup window on Oracle's own website and it's crap. It doesen't even show the animation in a decent window.
Then there's Domo, which is a client of ours. They have their own YouTube channel. They understand social media. Oracle looks like The Hartford Insurance Company by comparison. I have seen Domo tech and it's very cool. The backend is even more interesting than the front-end, which is formidable. These guys get it.
I just did something remarkably easy that reminds me about why Amazon is winning and will continue to win. I started up three different databases and then threw them away. For anyone who has worked in the Amazon ecosystem, this may seem like just an ordinary thing. Indeed it is, but let me take you back in time.
In 1985, I took my second summer internship at Xerox in El Segundo, CA. I reported to a guy named Jack Starkey who was one of those starched-shirt engineers of the first order. I loved the guy. My assignment was to build a data dictionary for the parts and service database that Xerox used to keep track of the maintenance of their top of the line laser printers. Xerox was considering migrating from an IBM VSAM hierarchical database to something called Focus, a new fangled relational database. My job was to insure that I had all of the definitions correct. I asked Jack if I could use the new Xerox Star Workstation in order to complete my job, which was essentially all about documentation. He agreed.
My first internship was more interesting because it was more technical. I actually wrote a financial modeling program. But my boss was loathed and feared in that area and a lot of people hoped everything he did would fail. That guy, whose name I actually cannot remember Jim somebody, was a notorious pipe smoker back in those days when you could smoke in the office. He liked something called Amphora Green.My project did not fail, although there were some interesting twists. My job was specifically to make a realtime pricing model that salesmen could use to develop a quote for customers based upon the way they actually did their electronic printing. At the time, most computer printing came out of printers attached to mainframes and the most popular one was the IBM 3800. But the Xerox printer had duplex and quadplex, meaning it could print on both sides of the same sheet of paper. My program would show the long term economic benefits of using the Xerox tech which often came down to power and supply costs. So I learned 'Total Cost of Ownership' at a fairly young age. The MBA intern with whom I was working had her HP calculator. My code was being held up because she was late in delivering a 'cost matrix' to me. I sat down with her finally to discover that she had just been plugging numbers into a formula run on her HP 12C. The MBAs didn't sit with the engineers, you see, so getting this meeting took weeks. I had to explain to her that this CP/M based computer could actually do that kind of formula calculation. She was shocked that Xerox actually made a computer that could do the same things as an HP 12C. We all learned a little something. I recall later on that a cat named Burkhart, whom I seem to have never forgiven in person, got the chance to present the wonders of my completed program to Xerox folks in London, while I went back to school that September.
But that was the pace of things in the mid 80s. Another dude I vaguely recall had the radical attitude I might have enjoined were I not so desperate to drive a BMW. He advised me to get all my hacking done before the implementation of ACF2. What we were dealing with was the gap between the time you could get engineers to understand a technology, the time it took them to implement it, the time it was adopted by the business, and the time that an actual payoff could be seen. Then there was the time it took for the capacity of the business to exceed the design limits of the system in place and the time it took for that problem to arise to the level of necessity and a new replacement process begun.
In the 80s, all of that was glacial. Moore's Law kept us all expectant about the future, mostly in terms of how much of all of the business processes could be captured outside of the pocket calculators of MBAs, but the process of business adaptation as well as the process of re-sizing systems have continued to be slow all the way to the current day. Amazon has emerged to understand these kinds of problems very well because of how e-commerce begat DevOps. And yet the majority of businesses still use 'Enterprise' architecture in their systems implementations. 'Enterprise', for all intents and purposes, was the term used to convince IT buyers that UNIX based servers could handle all of the business once only owned by the sort of mainframe computers that spit data out to Xerox and IBM printers. And then came e-business which introduced the 'n-tier' solutions, multiple tiers required because no single vendor, not even IBM, could provide hardware, networking and software solutions for the new class of applications being envisioned and built.
I don't have a buzzword we could reliably depend upon to be what this time of transition to cloud architecture will be. 'Post-Enterprise' is all I will hazard. However, what processes can and will be improved is a lot clearer. So I hope to speak to you, my fellows still using the systems designed with on-premise Enterprise class architectures in mind. That's my aim for this year, to help you see what I see and what my company, Full360, can do to help you realize some of the promises of computing that were made a long time ago and have been a long time coming.
So what I just did, single-handedly, today and yesterday, was build three different database servers Oracle 11g, MySQL and Microsoft SQL Server. They were each 4 core 15GB servers with at least 300GB of SSD. They took about 15 minutes to configure and secure, with automated backups in an alternate data center. I was able to connect to them directly through a secure VPN I had previously setup with my RazorSQL client using JDBC. Today, I'm shutting down the Oracle and MySQL services because my customer only has the MSSQL license. But it was actually fun playing with the alternatives. Fun!
What's going to happen next is that I will start using AWS Config to keep track of all of my compute and networking assets for my customers. I'll be able to tell, at a glance, which server is doing what in which stack in which VPC. I will be able to manage services to an even greater extent for this and my other customers.
Back in 2001, I had a dream about a company I might work for in the future. I called the company 3DB because as a fundamental competency I wanted people who understood object, relational and multidimensional database technologies. We could then build the kinds of systems I envisioned. Now I work at Full360, but the 3DB concept is reality. We use multiple database technologies in our data management framework called ElasticBI. I didn't envision them running in a cloud architecture, but that is even better. Stay tuned.