One of the great things about columnar databases is that you really don't have to worry about the cost of doing incremental updates to columns. In this post I'm going to talk about one of the ways we do that, and specifically about what I call 'errorfication'.
Errorfication is a way of thinking about the gaps in your analytic data. Imagine that you have - oh lets's say six or seven fact and dimension tables that you are combining in new ways that have never been thought of before. It may sound surprising, but one day you should go and look at a big Teradata installation. I have. Chances are, Teradata being what it is, that there will be hundreds of tables that only exist there and nobody has really had time to experiment with what various kinds of multidimensional analysis can be done. Teradata developers and hardware and all of that is rather expensive, so you keep everything (one you've paid for a node, why not?) and you build what you must. Well we came into a situation like that and pulled off all the history for a half dozen tables or so and dropped it into S3 with a Producer. In our case, we chunked the data into one month segments which gave us all we needed. Some of the tables gave us good sized chunks, maybe a half a million records a month, others just a few thousand. We quickly mocked up a black hole. What's a black hole? Well that's a collapsed star, you silly - a denormalized table with all the columns and hierarchies and attributes we need. Everything gets dumped into the denormalized table for reporting purposes, which makes it easy for a client like Jaspersoft to use as a domain. BUT.
There are foreign key inconsistencies. So when I made the black hole, I started going through a now-familiar process of iteration. Take the facts from your clean fact table and then group them together in your first select. What we want to do is something like this:
insert into final_report
selecta.field1,
a.fact2,
a.fact3b.attribute1,
b.attribute2,c.label1,
c.labe22from main_fact a
join dimension_one b on (a.key = b.key)
left outer join dimension_two c on (a.key2 = c.key)
You can see that I did just fine with my counts joining table A to table B, but table C fell short, so I performed an outer join. But that means c.label1 and c.label2 are going to be left NULL where there is no match on the a.key2. Are you with me so far? Since I know that possibility exists for every join, I may as well set up a series of errorfication fields which will let me know. That way I can left outer join all the time and not worry about my counts any more. So let's think about this in Star Trek terms. You know the scene, Spock checks the sensors and he's able to check 4 out of 5 of the facts about the alien planet, but some weird emission seems to be blocking the Enterprise' ability to get the fifth fact. With the left outer joins, you get all the facts you can get, and those that are missing can be qualified with the type of error you know ahead of time. At query time, you always have a full database for reporting and you may choose to include or exclude the partial data based on the error columns. Now generally speaking this is ugly business in a relational db, because if your 6 way join was painful in creation, imagine the time it takes to go back on it again like so:
update final_report
set missing_table_c_records = true
where label1 is null or label2 is null
Obviously the more tables you join, the more potential missing data you have. But it's better to know and present that as part of your entire analysis. Every time you update your final_report table, you have an idea of your data quality.
Enter columnar databases. In Vertica the cost of updating a single column is very low for two reasons. One, because it doesn't store the single value you change all across the database, it does a smart run length encoding. Two because it only deals with the column you select, not on 'records' as wide as your table. The same is true for Redshift although you should tell Redshift to use the run length encoding. Either way, the columnar technologies make it very inexpensive to do field updates. And that is how we can produce full datasets that have errors built in for missing data.
The advantages are that your end users will be able to make more sensible demands on data quality improvement. "Hey we need to get more data to populate table C', something that immediately translates into something your DBAs will understand. Furthermore, they will have a better picture of which parts of the entire data picture processed through the back end is more readily available and make appropriate judgement calls - more informed than they would be if the entire dataset were abbreviated to the net of inner joins, or blanks left with no explanation.
Comments