I noticed some folks reading through the Redshift category and noticed that I haven't written anything new for a while. So here's what's new:
We see that Redshift has improved its vacuum capabilities and added more functionality all around. It has improved its performance all around, but it hasn't changed its overall performance characteristics. Redshift is not fundamentally different after two years. It still behaves like Redshift when compared to Vertica, the other MPP columnar database we support at Full360.
We have been able to learn quite a bit more about tuning Redshift. Full360 will be offering this service soon. It's called Upshift and it is surely the most comprehensive performance evaluation available in the industry.
I have to qualify all of this by saying that I personally work a lot more with Vertica than I do with Redshift. These two products may seem very similar but the details are often overwhelming. Fortunately we are developing a methodology that expresses the rules for optimization very well. So while the characterization I've made still holds true, there will be a growing number of exceptions and interesting circumstances. I call Vertica a magical sword. It is powerful, precise and it sharpens itself. I can cut intricate and delicate patterns, and chop hundreds of large heads. I call Redshift an ogre's club. It is massively powerful, brain dead simple to use and relatively inexpensive. So you basically have to look at your application and know whether or not it is a job for a club or a sword. Our methodology will tell you exactly which, but like I said, the devil is in the details and we are wrangling dozens of demons.
The good news is that both products are improving at a good clip. Still, I confess I'm paying more attention to Vertica. I'm really impressed with the overview I got yesterday on Vertica 8. They've optimized some of their geospatial algorithms. They've incorporated several ML features directly into the core product. They've dramatically improved their integration with Hadoop, Spark and Kafka. They're claiming to perform 160% of Impala. So that's superb. Most importantly there is enthusiasm for the creation of the new company, which is less like Vertica getting sold to Microfocus and more like a rebirth of Microfocus itself who is, by the way, the owner of Suse Linux. The Vertica guys are thrilled that they'll be working for a software company. That means the upcoming integration with S3 is serious as is their priority on cloud implementations. All good.
So we have a bunch of big ideas around here at Full360 and it seems that I have been doing them rather than talking about them. That's what I like about work. Here's what we're doing now.
If you're going to handle big data, sooner or later you're going to get massive amounts that really are too bothersome to keep on your database server in any other form than IN the database. The cheapest place in the world to keep your data is at the NSA, they'll scrape it and store it for you free, indefinitely if you encrypt it. But the cheapest accessible place to keep your data is in Amazon S3. So our theory of DW says, keep your permanent record on S3, and build dynamic data warehouses as needed. What? Dynamic data warehouses? Yes, we shall explain but we're not there just yet.
The massive amounts of data on permanent record generally comes in a stream or in multiple streams. In fact, we have abstracted all of our dataflows essentially to streams and have build standard operations around those streams. There is some precise language that I have around here but the purpose of this blog is to just give an overview. Suffice it to say that there are timing and chunking issues between the way data is produced at the source and the way it is consumed into the data warehouse. Therefore, producers and consumers.
A Producer is a chunk of code that aims itself at a source of a datastream, periodically scrapes off a chunk and then stores an {compressed, encrypted, UTF-8'd, versioned} file set onto a time-hierarchally organized S3 bucket. Let's say that your stream was POS transactions from McDonalds. Every hour, you would make a smart query of the transactions based on a time window (which may or may not be redundant) and then send your results to S3. At the end of the day, you'll have 24 flatfiles in the path mickeyd-bucket-name/POS/2014/2014-01/2014-01-09_0100_MickeyDPOS.txt or some such. Easy to get, can handle millions of files, no problem. But the point of the producer is that it is synchronized to the actual production of data, whatever the source and chunks it appropriate to the most efficient manner of database ingestion.
Producers have smart sub-features. For example, it is the responsibility of the producer to translate XML or JSON to CSV. We like vertical bars (since we are a Vertica shop, nyuk nyuk) but we call it text file or a flat file anyway. Basically all of the cleansing REGEX, bogus character detection, and internal record consistency stuff takes place in the producer. The producer can be made smarter with regard to time (to avoid redundancy) by tying its query to the DW control table. Most importantly, the producer process itself can be scaled horizontally for scalability and/or redundancy.
The producers create a text-based data warehouse in S3 which is rationalized to the largest set of data any data warehouse can handle. It is all history, all fields, all streams. The consumers determine what goes live and queryable at what rate.
The Consumer is a chunk of code which generates smart efficient updates and merges to the data warehouse in synch with the way that data is consumed by the DW users. So it is the more conventional driver of database operations. In fact, you can think of the actual database production lifecycle in terms of the consumer. As my old good habits dictate, each of these are discrete operations that are verbs.
Capture
Ingest
Cleanse
Stage
Merge
Calc
Errorfy
Report
Distribute
Errorfication is something we've done before but it is a concept that needs a bit of explaining. I'll do that in another post. The rest are basically self-explanatory. Notice that from Ingestion through Merging, we increase the complexity of the operations but reduce the volume of records we ultimately work on. Since we're using columnar shardable database tech, this scales rather well, even though merging will always be a CS nightmare.
Note that the producers and consumers work together in real-time but you can choke them up or down. This is determined by the commit stamps on your control table. So let me talk about that a little bit.
Each file processed into the S3 buckets is going to be registered into a control table. This control table will let us trace all of our records back to S3. IE it is a key into the DW that gives us the provenance of the ETL process. You might be producing on the daily but consuming over the weekend in a 4 hour batch window. Nothing exists to the DW until it is registered which is done immdiately after a bulk copy of a file is loaded to the appropriate source table. Assuming the source table lives until the next step of ETL, the control table will know everything it needs to know about the content of the file loaded into the system. This control table's information is used to throttle the size of consumption chunks. I could go back 2 days, 30 days or two years on my consumer depending upon how much I want to burden my merge process. I could measure those two days by transaction stamps in the datafiles themselves or by the days they were produced to S3. Or I could just rebuild the entire warehouse from scratch or stream by stream.
These are the fundamentals of the dynamic data warehouse. Its great efficiency is that database servers are not burdened with local data - transfers from S3 to the VPC being cheap. It allows near-realtime data loads from disparate external sources, sometimes halfway around the world, and it allows for the exciting possibilities of adding open data into corporate DWs with relative ease. Naturally, dumped DWs can be published elsewhere. It's very cool, sez me.
I've got to say it has been really weird looking at the empty space that is Cubegeek for the past 60 days or so. But one of my goals for the year is actually going to be convergence. I've finally put so many things in philosophical order in my life, at long last, that I don't feel the necessity to separate all of my avocations from my vocations. Plus I worked out a good deal with my own skills, my intentions and the good folks at Dreamhost. But there's a lot of ground to cover.
The huge news, of course, is Amazon Redshift which throws a big 500 pound gorilla wrench in everybody's business model. A number of pundits have pooh-poohed the whole thing but I have to tell you, this is a major part of the future, whether you like it or not. Redshift is Moore's Law for databases. It's impossible to ignore. Quite frankly I'm not even sure how I can deal with the fact of its existence, because basically somebody with a reasonable amount of skills can put together a data warehouse, quickly. The upshot is that a lot of consulting can be done at home, the way I do it, and a lot of cheap - even throwaway DWs can be built. This has scary implications for the quality of said DWs and nobody knows exactly what sections of the market Redshift will come to dominate, but I can tell you this, our friends in the database world are defacating building materials.
I have been working with Redshift for several weeks now and its strengths are many. Primarily, I'm all focused on its elascticity and its price. Additionally I like that I can script everything at the API level. I haven't done all that yet, but I know that I can. It is lacking some nice developer tools at the Toad level, and if I were one of the guys at Panic Software, I'd make sure that is my next project. As much as I love the Bootstrap web interface that Amazon has got running, nothing beats a finely honed fat client. Anyway, the biggest strength of Redshift right now is its ability to load data from S3, and we're thinking up some techniques and product designs that are going to take advantage of that. So check back with me in six months and ask about Project Kleiglight. In the meantime, we are learning by doing in Redshift.
Here's my first opinion. Everybody who is using MySQL or MSSQL should migrate to Redshift as soon as they think they're ready for more performance. Period. Whatever market that is, I'll take it.
Here's my second opinion. Teradata is toast.
--
My Ruby-fu is up marginally. I picked up the Nokogiri gem and am now working a bit smarter with File. I've done some nice integration with standard unix command and also with loggers. So I would call myself competent with XML, YAML and JSON. I still haven't swung back to improve my Cucumber but am plenty comfortable with rspec. I'm working on a utility gem of my own for some text manipulation stuff that I do all the time. Next I'm going to play with the parallel gem to see how I can scale up certain ops.
I'm lagging on my seal book - the OReilly on Exploring with R, and I'm finally getting rid of the paranoia that sent me wheeling two months ago. Nevertheless, I still read Darkside and attend a couple security hacker meetups.
I've seriously upped my Chef game in the past couple months. Working on our elasticPM code has gotten me fairly deep into the implementation end of orchestration. It is now clear to me that much of what we have been doing is so utterly advanced - we've been on the edge in many ways of what Chef can do with Windows, and our unorthodox approach has been what has been making Chef's learning curve more difficult than what I expected. In addition, working more with Vagrant has improved my capabilities with virtualization. The next version of Vagrant is going to be awesome, I hear. So I've got about a dozen VMs here on my Mac. As they become migratable into AWS AMIs it's going to be awesome.
Speaking of which, I did get a chance to migrate an AMI across a region via the (2 month) old way of moving core snapshots. So before I could write code to automate that (but I've been busy) Amazon introduced a way to do it directly. So I haven't done it the newest way, but there's one more barrier to internationalization knocked over.
You have to realize that these days I consider myself to be something of an IT guy in the biggest IT shop on the planet, which is AWS. The new architecture is improving every month. More on this separately.
---
The reintegration project starts with me getting into a couple , three web architectures. I've gotten the static blog thing worked out with Jekyll and Octopress. So I'll probly migrate all this Cubegeek stuff under the single new site. But I really have to get this Node.js and Rails thing knocked out so I can speak that language to customers as well. You see a lot of our business comes from people with low resistance to moving their assets to the cloud - since a lot of them have used colocation before. I'm going to try to head a lot of them off at the pass since Amazon has DynamoDB, RDS, Redshift and Hadoop, three of which I've had my hands on. So a lot of the confusion I used to have over MongoDB, Cassandra, Riak, CouchDB and SOLR, I no longer have. I just ape the party line and say go Dynamo.
--
Vertica has been very very good to me. So my take on this in splitting the difference goes something like this. Redshift is for when you have invested *some* time into your DW and you want something low maintenance. Vertica is for when you need to tune the crap out of your system and you want near-realtime stuff. Basically, Vertica has all the bells and whistles for extreme computing. Redshift is more like MSSQL to Vertica's Oracle. Sorry, I hate analogies too, but that's about as close as I want to get to a hardball assessment in this post. I've played with a lot of databases in my time and I love Essbase and Vertica for the same reasons - their internals are beautiful and they enable an entirely new class of computing. However, I like Redshift for the same reason I like MSSQL, simplicity and elegance - except I know Redshift has a lot more upside than MSSQL.
I have worked with just about every major database technology going back to something called BCC out of Utah. Right now is the golden age, because today we have all the major technologies available in stacks that can be built on Amazon. It a very exciting time to be a data architect.