So here's a brilliant idea that's been around for a long time. Why not build a suite of specialized databases in an n-tier environment each for the kind of purpose that general purpose RDBMS are adapted to? I thought about that idea a long time ago when I was working with Heba B. over at i2 Technologies. It's an old story that I tell when I talk about the limits of multidimensional databases, so why don't I tell it now?
I2 was the leader in supply chain software and I was assiged to work with those guys to enhance their product called Rhythm Reporter. So we had a starter project and proved that we could build a multidimensional reporting model to the Rhythm data. What we couldn't do at the time was make a CORBA compliant data-broker in the back end for tighter integration. Well, we could have done it, but there was no funding for such a project. All things were go for a looser integration, however we had one weird catch.
When you are costing out a bill of materials in a manufacturing process, there are sometimes unit costs that apply to the top of a hierarchy. This was a problem for Essbase. Think of a table. There are four legs and a tabletop, five pieces that make up the major components. All of those would be children of 'Table' and their materials, unit and assembly costs could be aggregated. But sometimes there is a cost associated with 'Table' itself. Well how do you get that in without adding a bogus child? I think this is a problem that MSAS solved later, but we didn't have a solution at the time, nor did we have the facility to write custom Java functions for the Calculation Engine. So the problem went without a solution and against us at the time. So it got me to thinking about how best to coordinate multi-database solutions when a single technology couldn't hack it. i2's core product didn't use a relational database, but something odd and else, and its data model was twisted from its query language to populate the RDBMS. Heba and I worked on getting that relational data into OLAP. So I thought of the idea of '3DB', a company specializing in object/relational/multidimensional data problems and solutions. But since I don't have an MBA, I couldn't figure out a way to convince management to do anything with that. I mean, a company that wouldn't partner with i2 to build a CORBA interface isn't going to try any actually risky ideas, besides there were only about 7 people on the planet who understood what we were doing in any detail.
I don't know what happened to CORBA but the Java interface did get built. Still, the powers that be decided eventually that Essbase would be best suited, always and ever more, for a certain subset of financial apps. Stuff like supply chain was not in the offing - there was money to be made in ERP integration and that's the way it went. However, geeks like myself lamented the lack of other analytic applications that might be served. I can't complain really, it's been how many years since then, and I haven't made any money building much other than ERP integration apps.
But it only takes about seven people on the planet to think about such things theoretically and slowly roll out capable tech, and this is now how I think about the guys behind the Eigenbase project which was started about five years ago. They get it. Different data stores for different kinds of data.
First stop. LucidDB. LucidDB is a column based database. And wouldn't you know that they describe it in a way so clear that I get it the first time.
In LucidDB, database tables are vertically partitioned and stored in a highly compressed form. Vertical partitioning means that each page on disk stores values from only one column rather than entire rows; as a result, compression algorithms are much more effective because they can operate on homogeneous value domains, often with only a few distinct values. For example, a column storing the state component of a US address only has 50 possible values, so each value can be stored using only 6 bits instead of the 2-byte character strings used in a traditional uncompressed representation.
Vertical partitioning also means that a query that only accesses a subset of the columns of the referenced tables can avoid reading the other columns entirely. The net effect of vertical partitioning is greatly improved performance due to reduced disk I/O and more effective caching (data compression allows a greater logical dataset size to fit into a given amount of physical memory). Compression also allows disk storage to be used more effectively (e.g. for maintaining more indexes).
The companion to column store is bitmap indexing, which has well-known advantages for data warehousing. LucidDB's bitmap index implementation takes advantage of column store features; for example, bitmaps are built directly off of the compressed row representation, and are themselves stored compressed, reducing load time significantly. And at query time, they can be rapidly intersected to identify the exact portion of the table which contributes to query results. All access paths support asynchronous I/O with intelligent prefetch for optimal use of disk bandwidth.
Nice. So I'm going to play with this puppy and then circle around to see if I can learn some more.
Now I've been pointed, by a thoughtful reader, to a project called Firewater. If I understand that correctly, it means that I can partition out what's partitionable in LucidDB over multiple nodes. The implications are that I could have a maintain a model on a cluster of blades and... well, I'd have to do some thinking about that. But here's my first guess. My first guess is that the licensing model for Essbase has made multi-server partitioning impractical. My second guess is that I would put my measures in columns and could initiate parallelism in my aggregations...
Does anybody know if transparent partitions in Essbase are shared nothing or are/were there communications bottlenecks that messed up performance? What were the practical limits to Essbase partitions? I'm sure I never did more than 6 in a single model. That was CCE back in Atlanta, wasn't it?
Comments