We already have the concept of the Producer and the Consumer. Now we add the concept of the Curator.
The purpose of this is to regularize the idea of dynamic resizing of the query space. The Curator would thus be responsible to know which data are permanent and unable to be further modified. Let's take the example of a financial data warehouse. Generally, we allow modification and restatements of data past the close of a fiscal period. At some point however, there needs to be a cutoff. This may be one period, two period or perhaps 6 or 12. The longer the period, the more transactions we would likely process in any refresh. For the sake of our example, let us allow that cutoff to be three months, with a grain of one month.
So every day we get a restatement of the current month. This is to be expected. But we may also get data which include transactions that are up to three months old. However on June 1, we would close off the trailing month of our three month window and accept no further updates to March. At this point, the Curator would engage the spinning database and push out the final version of March. This should be done in the most efficient manner for the sake of a database load. So the Curator has a kind of producer itself. This Final of March will go into its own S3 bucket and be registered with a Curator table. Obviously this is a job that can vary in size with the desired grain of curation. I would recommend a month or a quarter. Lets call this operation a Curator::SetFinal(2014-03). This makes a permanent copy of the last official update for that month of data and stores it out to the Curator's store.
Now depending on how we want to handle it, we can either spin up a new size database in parallel, or drop and rebuild the production database overnight to the new size. Lets say we want to do that operation to expand the history window from 3 years to 4 years. We could call that Curator::Exhibit(4 years), and that would pull the appropriate data from the Curator's store. Bottom line is that the Curator can dynamically resize the query space by using data certified by the application.
Not bad eh?
Comments