I keep worrying about being formal vs being casual when I talk about what I'm doing professionally. So I have a whole lot of half-finished outlines describing the shift in my work over the past three years. But I suppose I can speak with enough authority to be casual now as I continue to talk about big data - and yes my biases will creep in.
Let's identify big data.
What makes big data big? The fact that you can no longer take operating with it for granted. It's that simple. There are several dimensions into which your data gets too big for the britches you bought it. And if you've heard this before, please forgive me, but consensus has to do with independent validation of what others are experiencing.. So my aim is to find the pain as I start diagnosing what kind of data headaches you are dealing with.
In general, I find that people are much more specific and consistent when they talk about pain than when they talk about what they hope for or what they think otehr people might be thinking - so let's take the headache approach to defining specific types of big data.
1. Velocity
Data gets out of hand when you have to process it faster. Think of those processing windows for your SLA. What is real-time, near-real time? How much data do you need to see that is very recent? Is there a use case for the analysis of events that have taken place in the past two hours? 30 minutes? 5 minutes? Every database can process data in near real time, but how much do you need processed every minute? When this gets to be a headache, you are dealing with big data.
2. Veracity / Validity
When you have to process a billion records how many of them are perfect? Meaning that if one field is bogus, you have to identify that cell within the mass of data you are processing, record by record. When it comes to database ingestion, chances are you can't just segregate out a million errors and then re-run them separately as a standard function of the database. So you have to engineer a process that will do so for you. When this gets to be a headache, you are dealing with big data.
3. Variety
In your current system, there is some data you have to deal with that makes you cringe because according to the current business rules you have to make six passes over the table in order to categorize it properly. Fortunately there are only 500,000 of those such records. If that section of the data gets big, you have a problem. Or how about this. Some data you have requires 255 characters and there are no short descriptions. Or some data you have possesses two short keys and 275 floating point values. Big data means that these typical 'outliers' may now exist in significant sizes. If your weird data is getting larger, you are dealing with big data.
4. Virtual Hardware
Obviously, some numbers just sound big. Since I am learning to think *really* big when I consider the capacities of Amazon Redshift things change. "More than I ever had to deal with" is actually quite big enough. But let's talk machine size for a minute. Once upon a time, The Gap was tapping their feet waiting for an OLAP that could handle a million member dimension - for their marketing. They made available Sun's largest single machine, an E9000 or some such and re-niced our processes so we could have more of the box than any single program before. Once upon another time, HP made a 36 processor Superdome available for my team as we processed data for a design at Boeing. In those days, that was huge. Today both the Sun and the HP could fit inside of the processing space of a five node 8XL Redshift cluster with plenty room to spare. If you cannot conceive of compute power above single boxes, no matter how large those boxes may be, your data may soon become a headache.
Let me expand upon this point because it is the point over which some rather serious economics hinge, especially those economics about your job in the enterprise.
In business computing, there will always be unpredictable market conditions, evolving product performance data, changing skill and knowledge sets within the management and rank and file, and wildly varying customer information, not to mention regulatory compliance reports. That means to keep your business on top of the data being generated in your industry will always be beyond your ability - even if you have all the resources of Fort Meade. So what you have been allowed to build has always been constrained by budget, time, resources.
The Big Data revolution is all about putting better tools and processes in place which allow you to do more with less. If new products and technology don't expand your capacity, then you are wasting time, money and effort. So the opportunity should be considered in terms of what you can do (more) with these new advances.
If your organization's aims are to cut costs and keep the same functionality, your job is at risk anyway. It's not the cloud that threatens your position, it is your management's view the value of your work to the company. So if you feel threatened by the existence of cloud computing, what you need to do is start considering the unpredictable market conditions, evolving product performance data, changing skillsets and regulations as they relate to your unique understanding of what your company is experiencing. If you can't come up with ideas for improvement your management will by into, that says more about your company than it does about the cloud. The cloud is happening. It presents new economies of scale that your company does not have to pay for to research and innovate on. How your company takes advantage depends upon the culture of your company.
Let's go back to basics. Why do we build systems and how do we judge their effectiveness? I have golden rules for end-user computing. We'll talk about those in the next part where we can identify where change is happening, understand if we need big data technologies to solve those big headaches and then demonstrate the value you can bring to your business.
Comments