I am between projects, and I feel the pull of the imperative. When I have time, not working, my mind reminds me of all those things I don't know. And so I need to know which means when I have a minute I prefer to learn rather than to be entertained. At least I do when I have peace of mind, which I do right now.
But since I'm such a scatterbrain, I'm going to need to blog my way to sanity and keep track of all the 20 minute foci I am able to maintain when my impatience gets the better of me and I pop from window to window.
So I'm reading my Java and my Python, learning both simultaneously. I'd like to play with Ruby and may have to, but it feels like cheating and I've been cheating too long. I need to address the sophisticated fundamentals so that I can make choices more intelligently in the future. In otherwords I'm invoking the Overkill Rule.
So I've got Eclipse set up on each of my machines. I'm making Vega into the big server - although I should probably have one more server. I'll have to figure out what's light enough to put on the smaller machines. I think at the very least, I'll have LDAP on Metis. Although I'm finding it very easy to do installs on OS X, and it works the way I like - so there's a decent chance I get a Mac Mini for Christmas.
I am getting frustrated finding a decent JDBC driver set for free. I've gotten the connections to work across the two MySql databases that I've setup using Toad for SQL, SQLYog and MySQL Workbench, but not quite there with the Eclipse SQL plugin. Yes of course its overkill, which is, as I said, the point. What I really want is a good set so I can play with the basics and then wrap my Talend and PDI toolsets around them. But I got source level packages which meant I had to figure out the com.blah.this.thatDriver file hierarchy and stuff for Java. Should be easy in a couple weeks as I get to understand the differences between the way Eclipse would build stuff and how Git would (with Ant?). So the current task is getting a handle on Java builds, debugs, packages, classes and all that organizational rot. Annoying Frosh stuff.
I'm pretty sure that I don't want NetBeans. I think I'll be happy enough with Tomcat and ignore JBoss and Glassfish and won't worry about the integration. I tend to go a bit old school anyway and got TextWrangler (instead of BBEdit) for the Mac. Since I can run shebang..
If you've heard me blather about my technical accomplishments for any length of time, then you'll hear me brag about the Boeing CPS project. It was by a good measure the most ambitious project I have ever been involved in. The short end of the story was that I showed how I could get 500 databases to run concurrently in memory. I performed that minor miracle with the help of a a guy who knew LoadRunner inside and out and a team of techs at HP. You see this was done on a massive HP Superdome machine. I did this about four years ago at HP's Cupertino campus.
It turns out that HP has just sold off the property where such levels of supercomputing research, testing and debugging was done. This thus marks the end of an era, because the dudes I worked with up there had a bit of the shakes at the time. So if you ever heard me snark about Mark Hurd, it had something to do with the vibe they had knowing at the time that things were not going as well as they should have been. I was very impressed by the staff for the three weeks that I was up there, but I could see that we were sorta alone. I mean here was some of the coolest big iron available, when the enterprise IT world was still freshly adapting 64bit computing - the one thing that restored my flagging faith in Windows. I could work with 32 processors and 64GB of Ram. How sweet was that?
Well now that we know some of Google's four year old magic, it's not very sweet. And if that campus was the last place HP engaged massive computing with companies like Boeing for doing stuff that required that level of compute power, then this sale to Apple represents another piece of inevitability for cloud computing.
I'm setting up my first cluster at home. I just got the OReilly : Tom White's second edition and I'm going to plow through it. I know, I keep saying all these things I'm going to do, and I never seem to have enough time to follow through, but now I don't feel so much like a fool alone in the wilderness as I did when I first looked at Nutch.
I'm looking back to when I first had some notion about this which is when I joined Hackett - or maybe a couple years before. See, there were some guys at Tellme.com that I wrote about here. They used MapReduce to setup data to feed to Essbase. So I want to follow a similar path.
It turns out that our old friends at Pentaho announced that PDI 4.0 aka Kettle, now has a version that talks Hadoop. It's a big deal. And while I'm keeping an eye on Karmasphere, I think that's a good bunch of news. So the other day I got Spoon up and running and played around a bit. It reminds me of exactly what a friend of mine told me about Talend, the community editions have UIs in name only, the enterprise editions and add-on tools make the the damned thing productive. Plus they know the performance shortcuts. Fair enough I suppose, so long as it's working glue, I won't complain. Considering that MSSQL has a free version too, there has got to be at least a few things PDI can kick ass at. That's to be determined as I sneak another server into the house.
I'm going to build a little Hadoop cluster in my garage and attach a couple terabytes. It shouldn't be hard and I don't really care about performance right now. I just want to get a working environment in my grubbies and have, at the very least, a working knowledge of how to put the thing together. I wish I knew somebody in my neighborhood that was interested in this kinda stuff.
I'm sure there are marketing folks who can't wait, but I get a queasy feeling watching this cringeworthy video that this Christmas season is the beginning of the end of a certain romance geeks have with Apple.
We all know the popular darling Apple has become - the consumer behemoth that it might become, with all sorts of clever encroachments on computing.. I don't know. Think about it. Ten years from now, when the majority of enterprise IT will be done in the Linux based cloudscapes, where will Apple have gone? Gone to market, every one. When will they ever learn?
One of the problems of being brainy, as I was explaining over mai tais last Friday to my friend Larry, is that a lot of things stick in your head. Many, if not most of these things are of little or no value. So the brainy person, being so gifted or cursed as the case may be, must come up with a system of drawing his own attention to those things that are actually valuable. I do this through my writing and blogging, and eternally sussing out my philosophy. It works more or less, but I can only say so because my wife doesn't demean what I do for a living. Odd benchmark perhaps, but it suffices. So in order to underscore the oddness of a sticky brain, I sung a Nestle's Quik jingle from the 70s.
When your stomach's got the thungries And you don't know what to think It doesn't want a cookie an apple or a drink It wants that special something Something sweet and rich and thick Something smooth and cool and chocolaty It wants a glass of Quik.
As far as Google knows, this is the only place on the planet to find such complete lyrics, somewhere deep out of my crazy sticky mind. And now you remember too, or if you don't, good for you.
Memory, like creativity, is associative and bursty. At least that's my theory. So with the keyword 'Thungries' you get all that. Sometimes you have the keyword and the rest of the memory remains buried. And it is only since I've had some time off these past few weeks that my mind becomes my own again, and I don't have to exploit its stickiness for the various vicissitudes of worklife.
One day last winter when I was in the midst of thinking, an old problem asserted itself and I decided to think a bit about how to solve it. As I usually do, I gave it a couple keywords and then promptly forgot about the details. This happens a lot. I leave echoes of the solvable in my head and then kind of dismiss it because it's solvable. (remind me to repost my classes of problems). The keywords were 'Clumper' and 'Ragged Robin'. And now that I've had the glorious experience of restoring several hundred GB of my own data, I happened across some notes and now the ideas are back in full form, like the memory of how a song reminds you of an intersection you were driving through with the radio on, or vice-versa.
The problem, a bit more formally stated is that fairly common MDM problem of harried accountants who can't decide if a certain expense or revenue should be made distinct by virtue of its account number, its department, or its entity. And you get combinations that violate dimensionality. In other words, it is the expression of will that many financial guys project into their systems: account 4000 means what I say it means because I say so, but yeah I changed my mind two years ago. In yet other words, it's a taxonomy problem.
While I key on the word 'taxonomy' let me remind you of the best seminar I've ever heard on the taxonomy problem (the solution is tagging). (I'll find the link to the Long Now seminar later and place it here - Jump to 20:00 and consider Dewey Decimal).
Practically speaking in your source data you have bad 'clumps'. You have associations of accounts, departments and entities that have become catch alls. So the first part of solving the problem is identifying them and marking them behaviorally. That is to say when you aggregate the detailed transaction data you might find some unusual number of bookings that find their way into the aggregations that you are using for your detailed level of analysis. My experience says obviously this happens when you have an 'Other' account, but not so obviously in Marketing budgets when people can't decide which media bucket to use, or there are new combinations of joint marketing programs. Use the general account for the media department or the media buy account for all departments?
Clumper creates a simple histogram of the combinations. Duh. Brain dead simple. But it gives an insight as to where the aggs are coming from *below* the lowest level of detail in your ordinary analysis. And this can be used in addition to the business sense of the *amount* of money, activity or whatever metric accrues to those combinations. Maybe your inputters are gaming the system in ways that you cannot see in multidimensionality.
This becomes more apparent when you consider write-back. Which is to say which of those KPIs do you actually plan to? How many times do you have a situation in which various entities are allowed to book actuals into accounts that nobody plans to? What's the ratio? How do you decide rationally to make people plan at a more detailed level? Clumper will give you clues.
What about Ragged Robin? Well it's job is to give an interactive accounting of all the clumps found and allow a remapping. In other words it's a data matrix. Once you have rationalized the good and bad clumps, those that will allow some smart multidimensional analysis then you can re-run it with retagged metadata. Let me repeat that because in n years, I've never had a customer ask me to do that.
Once you have rationalized the good and bad clumps, those that will allow some smart multidimensional analysis then you re-run it with retagged metadata.
Get it?
OK let me explain it one other way so I can remember again some time in my own future. Think of the time you had to register for classes in college. You stand in the queue A-F for your last name. Did you ever wonder if they ever make the signs actually be representative of the frequency analysis of the actual registrants last names, or do they use the same signs every semester?
Clumper does the histogram analysis. Ragged Robin remaps and fixes hierarchies according to actual usage.
So here's a brilliant idea that's been around for a long time. Why not build a suite of specialized databases in an n-tier environment each for the kind of purpose that general purpose RDBMS are adapted to? I thought about that idea a long time ago when I was working with Heba B. over at i2 Technologies. It's an old story that I tell when I talk about the limits of multidimensional databases, so why don't I tell it now?
I2 was the leader in supply chain software and I was assiged to work with those guys to enhance their product called Rhythm Reporter. So we had a starter project and proved that we could build a multidimensional reporting model to the Rhythm data. What we couldn't do at the time was make a CORBA compliant data-broker in the back end for tighter integration. Well, we could have done it, but there was no funding for such a project. All things were go for a looser integration, however we had one weird catch.
When you are costing out a bill of materials in a manufacturing process, there are sometimes unit costs that apply to the top of a hierarchy. This was a problem for Essbase. Think of a table. There are four legs and a tabletop, five pieces that make up the major components. All of those would be children of 'Table' and their materials, unit and assembly costs could be aggregated. But sometimes there is a cost associated with 'Table' itself. Well how do you get that in without adding a bogus child? I think this is a problem that MSAS solved later, but we didn't have a solution at the time, nor did we have the facility to write custom Java functions for the Calculation Engine. So the problem went without a solution and against us at the time. So it got me to thinking about how best to coordinate multi-database solutions when a single technology couldn't hack it. i2's core product didn't use a relational database, but something odd and else, and its data model was twisted from its query language to populate the RDBMS. Heba and I worked on getting that relational data into OLAP. So I thought of the idea of '3DB', a company specializing in object/relational/multidimensional data problems and solutions. But since I don't have an MBA, I couldn't figure out a way to convince management to do anything with that. I mean, a company that wouldn't partner with i2 to build a CORBA interface isn't going to try any actually risky ideas, besides there were only about 7 people on the planet who understood what we were doing in any detail.
I don't know what happened to CORBA but the Java interface did get built. Still, the powers that be decided eventually that Essbase would be best suited, always and ever more, for a certain subset of financial apps. Stuff like supply chain was not in the offing - there was money to be made in ERP integration and that's the way it went. However, geeks like myself lamented the lack of other analytic applications that might be served. I can't complain really, it's been how many years since then, and I haven't made any money building much other than ERP integration apps.
But it only takes about seven people on the planet to think about such things theoretically and slowly roll out capable tech, and this is now how I think about the guys behind the Eigenbase project which was started about five years ago. They get it. Different data stores for different kinds of data.
First stop. LucidDB. LucidDB is a column based database. And wouldn't you know that they describe it in a way so clear that I get it the first time.
In LucidDB, database tables are vertically partitioned and stored in a highly compressed form. Vertical partitioning means that each page on disk stores values from only one column rather than entire rows; as a result, compression algorithms are much more effective because they can operate on homogeneous value domains, often with only a few distinct values. For example, a column storing the state component of a US address only has 50 possible values, so each value can be stored using only 6 bits instead of the 2-byte character strings used in a traditional uncompressed representation.
Vertical partitioning also means that a query that only accesses a subset of the columns of the referenced tables can avoid reading the other columns entirely. The net effect of vertical partitioning is greatly improved performance due to reduced disk I/O and more effective caching (data compression allows a greater logical dataset size to fit into a given amount of physical memory). Compression also allows disk storage to be used more effectively (e.g. for maintaining more indexes).
The companion to column store is bitmap indexing, which has well-known advantages for data warehousing. LucidDB's bitmap index implementation takes advantage of column store features; for example, bitmaps are built directly off of the compressed row representation, and are themselves stored compressed, reducing load time significantly. And at query time, they can be rapidly intersected to identify the exact portion of the table which contributes to query results. All access paths support asynchronous I/O with intelligent prefetch for optimal use of disk bandwidth.
Nice. So I'm going to play with this puppy and then circle around to see if I can learn some more.
Now I've been pointed, by a thoughtful reader, to a project called Firewater. If I understand that correctly, it means that I can partition out what's partitionable in LucidDB over multiple nodes. The implications are that I could have a maintain a model on a cluster of blades and... well, I'd have to do some thinking about that. But here's my first guess. My first guess is that the licensing model for Essbase has made multi-server partitioning impractical. My second guess is that I would put my measures in columns and could initiate parallelism in my aggregations...
Does anybody know if transparent partitions in Essbase are shared nothing or are/were there communications bottlenecks that messed up performance? What were the practical limits to Essbase partitions? I'm sure I never did more than 6 in a single model. That was CCE back in Atlanta, wasn't it?
I've started to look at getting closer to technology and re-energize my traditional track which is data architecture. One of the things I have found, much to my dismay, is that some of the web scale guys have really taken off over the past four years and have really beat us enterprise DW & BI guys in terms of scalability.
So there are a lot of places I've been reading to put my head into that game in my spare time. There are names and blogs that I'm going to socialize into Cubegeek. A lot of this started with Cloud talk, and it's clear to me that that's just a bit too broad, and according to some folks I talk to, premature. That doesn't stop it from being very interesting. I've paid a lot of attention lots of places. Curt Monash was my first stop.
Now it turns out that about two years ago, I made a call with my buddy out to West LA to speak with a guy named Jody Mulkey. Over at Shopzilla, they had some scalability problems with Essbase. So I talked to the guys there and they turned out to be pretty sharp. They seemed to have done all the reasonable stuff in tuning what Essbase they had, but were not up on the latest versions. So since the company I work with is an Oracle partner, we basically had to wait for the Oracle rep to get the proper paperwork signed so that they could get all of their enterprise licensing together. With any luck they'd call me back and then I could maybe hang out and try several things with some clustered ASO. After all, I did know a little bit about running multiple Essbase servers. Well, it turned out that the Oracle rep(s) involved weren't so particularly interested in getting the paperwork done and there were quarter end considerations and all that kind of malarkey. Bottom line, we never got called back in.
It was a cool day because I did get to meet Mulkey who seemed like a cool guy and I liked the way they talked about systems there as well. It was an interesting day, the day just after Obama got elected. I had on my monkey suit and people were looking at me differently - but what I remember most was the bigscreen they had there where sampled random sales from Shopzilla referrals popped up, GIS style. The contrasts were splitting my head because here was a serious IT shop in a very casual communications stymied by dumb bureaucracy, and I was the guy wearing the suit - but they had awoken my inner geek.
Since then, I started thinking about clouds and whatnot, but my company was not interested. I've also been following Greenplum, Vertica, AsterData and news about them. Way before that, if you go back in this blog, you'll see I was looking at Bigtable and such. I understand the strategy, but now it's time to get to nuts and bolts. Nuts and bolts means Java APIs and more primative ROLAPery than I've been used to playing with Essbase.
Over the past couple weeks, I've been playing around with multiple installations and finding out how surprisingly many linux apps in the open source world are also functional on Mac. My general aim will be to get more deeply engaged in the open source tech & business, which I am starting to get a better feel for. What I hope is that by engaging at the Java API level, I'll pick up a technology which is not entirely disposable. As well, of course I'll be making sense of Hadoop clusters, and interfaces to all the data. In other words, I'm going back to being a hardcore data architect, harder than before. I expect to use a larger toolkit with more open pieces. Even though MSSQL Server is free, there's more stuff out there.
Yes Virginia there are interesting and smart people in Southern California. I just came back from my third Meetup, and I'm telling you that's a great tool. I met folks in the web business and they are doing some pretty interesting things. The flavor is definitely entrepreneurial and I've heard some exceptional stories. Maybe because they're new to me, but it's really refreshing to know that folks are making money themselves on the strength of new ideas and new technology.
William has got momentum now, and he had to change the venue from a smaller joint to Wokcano which is rather chic even though it costs 4 bucks for a Red Bull. He hipped me to RIAK by Basho, and although I don't quite have my head around how I might get BI knowledge out of massive document stores, it is interesting to hear that people frustrated with Hadoop are happy with RIAK. I seem to recall an Xtranormal movie joke about Hadoop cluster failures, and that's the thing that RIAK doesn't do. Surely Cloudera will come up with a smarter way to do system monitoring, but the legendary (now) Foursquare failure is a lesson that the big data community is trying to learn from. Oh yeah and Joyent. I remember those guys. I guess they're the hotshots - they should know a whole lot by now.
So I'm expecting that what I might like to learn is something about the fuzzy area between unstructured and structured data. Given my understanding about analytic consumption and EPM feedback loops I'll add value. So this is where Derrick was telling me about Solr. It sounds like Solr is part of the missing link between structured and unstructured, or at the very least can assist in indexing massive data sets. While I've been warned away from Cassandra, there sounds to be something Solr can do to make up for its shortcomings - all of which is very interesting. He also tells me that I should check out Splunk. Hey wait a minute. I know that CEO. Well, whaddya know? He was keying in on the phrase 'machine generated data'. Yep. Nice. The other term he used was 'faceting'. That's a nice way to structure up stuff on a big data set if you don't know exactly what it might contain beforehand.
Let me think about that for a minute, adding to the minute I thought about what was interesting and weird about Qlikview. In Qlikview and in SAS, I seem to recall the ability to do what seemed to be a random kind of drilldown - not in the way that made sense from a multidimensional design standpoint. I'll call it something like faceted search, not knowing exactly what the proper definition is. But imagine that I index all of several fields in a data set, but don't organize them into hierarchichal dimensions. I can still drill down a path to narrow the data set without predetermining dimensions given the cardinality of remaining index counts as my key - the cardinality being an interesting part of the dataset itself.
We commonly do this at BestBuy's website. You look at cameras, and then down the left banner, you get counts of all the brands of cameras available. You also have certain attribute ranges on the prices. 0-100, 101-250, etc, and counts on those. At the bottom of your drill down are particular documents. Nice. This kind of navigation is looking for a single item or maybe a single class of items - it doesn't make sense to aggregate all of the data up to the highest level, so this could be a preliminary search into interesting data - something sitting on top of cubes to go - a nice way to winnow down a huge data set. Considering that free Splunk will parse and index 500MB/day, that's a good way to get started. Thanks Derrick.
Last year when I went to the CalTech seminar on clouds, I met a couple guys in the lobby who were ripready to roar with their hosting. It looks like cloud hosting is hot and heavy here in LA. Dennis is already considering another property. It has been several years since I signed on at Dreamhost, so there's evidently a new generation going.
I really enjoyed talking with Jad. He's got the right ideas for aggressively attacking important markets with targeted solutions. Smart pricing model too. He reminds me of Levi whom I met last week over at McCabe's. One smart guy designing with a small team of engineers can take a big bite out of the marketshare of the larger software firms. Agility matters, especially when open source interoperability is so real. How real? That I don't know, but I do know that one size cannot fit all and the software vendors who come to recognize that their products can be rightsized to their customers are going to win big. Think about that for a minute. Why pay for a legacy of features you don't use?
Shout outs to John from Boston and Ian. And to the guy from the VA, it's Charles Wyble.