Where do I begin to tell the story? How about we go from the epiphany backwards? This morning I read this from Dave Sisk.
Hadoop is not even in the same ballpark as any of the CDBMS's that have been mentioned. It's not even a database, for that matter...it's a giant MPP ETL process...which is a great thing IF you use it as a giant MPP ETL process instead of trying to use it as a database. I've examined HBase closely, and looked at a colleague's implementation...it has to be the only key/value store that I can think of that is a worse piece of crap than MongoDB. (At least MongoDB is better than something.) My colleague's company's 60-node implementation of HBase (on big honkin' enterprise hardware) struggles to insert 2000 rows of data second (I can insert at 10-20 times that into PostgreSQL running in a VM on my friggin' laptop), and reports run for hours (sometimes days). You can do the same work in a good columnar RDBMS 2-3 orders of magnitude faster...as in milliseconds or seconds (minutes at worst), instead of hour or days.
At a prior company, we used Vertica to consume hundreds of thousands of rows per second, and could return results from billion row high-reduction queries in a few hundred milliseconds (from about 5 hefty nodes)...Hadoop/Hive/HBase with hundreds of nodes could not come within 2 orders of magnitude of that kind of performance, no matter how much hardware you throw at it.
That is what I've been waiting to hear for years. So I wrote back.
Finally thank you for helping me understand what I didn't, that there is a massively huge performance gap between Vertica and Hadoop. I've been wondering why I never meet Hadoop folks in my space (Business Intelligence) and I've been listening to the guys at MAPR tell me theories of why their Amazon-approved version of EMR is going to be a world beater. I have assumed that all Hadoop is simply a very large scalable file system (and I've started to call it HDFS) + some clunky tools that lack a semantic layer. But I assumed that a reasonably competent Java programmer (who wants to be that?) could make it perform at a good clip. No matter what the performance, I expected that it was mostly ETL class tech. BUT I figured sooner or later somebody was going to build a semantic layer of SQL onto it and then it would be serious competition for columnar DBs, primarily because of mind and market share.
However, if Hadoop is little more than a glorified giant file system and map reduce is to HDFS as regex grep sed & awk + perl is to ordinary file systems, then there's no way it will ever compete on performance and cost efficiency to columnar tech.
I'm going to assume HDFS is a data lake from here on out, unsuitable for BI queries.
That assumption means a great deal in my world and it erases a class of insecurity I've been having over every discussion of 'big data' I've been a part of for several years. You see none of my customers have Hadoop or ask for Hadoop. All of them can form the syllables. They know about Hadoop. Hadoop *is* big data, as far as the layman's world (and another world that is not BI). So I haven't had a *reason* outside of great curiosity to build anything with Hadoop.
First of all, I have S3. There's no bigger data lake I've ever needed. Terabytes are not a problem. I mean I've got terabytes at home. But under ElasticBI, I've got S3 whipped into shape and smartly integrated to every database I care about. (Vertica, Redshift, Essbase, VoltDB). So I never have to concern myself with running out of space for staging or moving massive amounts of data. It's always about database optimization. That requires structure, structure requires purpose, and I know how to get that from my customers. Hadoop is about unstructured data.
Now when I start talking about unstructured data, I mean web data that has a volatile structure. And in that space I see tools like MongoDB, Cassandra and CouchDB with Couch as the winner. I've heard horror stories about Mongo, and that Cassandra is a big tease that never quite gets all of her drama together. There's also Riak which is heavyweight champion in the space, so I hear and believe. But I'm not building ecommerce websites and I don't need to manage volatile content and serve up XML to be rendered, so I have no need for that class of data management. Not right now anyway. I want to be the master of all data management, data store and database worlds, but I have to deal with one continent at a time, and I can accept that Riak and Couch are on another planet. Planet Unstructure.
But when they say they're going to put on SQL clothing and take their big data + analytics into the Data Warehouse realm, that's war of the worlds. I don't like the prospects of that imminent invasion. Because, really, websites are the masters of handling 10,000 simultaneous concurrent users. I can't do anything like that. Those guys must know something.
Well, so did LAMP stackers at one time. I'm going to take a gamble and commit HDFS and all that manipulation to the back porch. It's not competition in BI, and it never really was, but that was really in my mind a matter of market focus. Now I understand technically that there are hard limits to what all of that can ever do and strong technical reasons why what I do with my database tech is not threatened by these other systems.
I still want to know. I still want to spend some hands on time and eyeball some NoSQL tech in the context of their own systems. I wish I knew a guy, personally. But maybe our paths haven't crossed for good reasons. Either way, I am stepping out of the shadow of the elephant. Hadoop is just a data lake fallback for collecting a lot of stuff that may or may not be map reduced into something coherent. It's a specialty ETL transform that we at Full360 will do with realtime streams. So maybe if my Colorado River data pipeline fails, it will form a Hadoopified Salton Sea while we fix the levee. What it most definitely is not, is Fast Data. Meaning it has no place in the IoT future and is an artifact of what we will inevitably call something along the lines of JBOD. Lakes are meant to be drained. And now I've completely introduced heterogeneous metaphors into this. But that's kind of how epiphanies work, neh?
I'm out of the shadow of the elephant who drinks up lakes with that weird trunk of his. Yay circus tricks!