What is big data?
I want to eliminate some confusion by giving a history of big data from my perspective. I don't know how long this will take, but I'd guess about 5000 words. Let's get started.
Big Data is clumsy. Because it is big the habits you have with non-big data become impractical. Data becomes 'big' when it forces you, by its size, to adopt new processes, hardware or software to deal with it in a reasonable period of time.
Part One - Personal Data Growing Pains
I started thinking about big data a long time ago. I'd say it was somewhere around 2002. At the time, I used a laptop with about a 5GB drive. This was about as much as I would ever need to run my 32bit OS (Win 2K Pro which soon became Win XP Pro). But I had collected about 4000 names in my Palm Pilot contacts. As I had to move and sync this contact information with Outlook the system started to fail. It got slower and slower. The same thing was happening with my email, which I had all in Eudora. Migrating this all important data from laptop to laptop began to be a serious problem. One day something went wrong. Somehow my address sync created duplicate records. Now I had 8000 contacts, Outlook slowed to a crawl and there was no way for me to deduplicate the records except by writing a custom progam (I didn't know how) or manually. It took months.
This simple mistake put me in a class of user that the builders of Outlook didn't consider mainstream. Even years later as I uploaded my contacts to Google's new mail service, GMail, it was very slow to lookup a contact. I could have paid $200 to purchase the popular contact manager Act!, but I didn't want to. The point was that my data was too big for Outlook - the way I had managed contacts all along.
So the first notion I want to communicate is that no data is really too big. Somewhere there is a system that can handle your data no matter how big you think it is. The question is whether or not you must endure a paradigm shift to accomodate your outgrowing of your old system. Let us recall archive.org. The Internet Archive has been eating and saving the entire static WWW for a decade now.
What really began to give me problems was my music collection. At some point I began using iTunes to rip my 600+ compact disks. This filled up my laptop in no time. I started buying external hard drives. At firstI got an expensive LaCie Porsche Design external USB drive. You may remember them from that Will Smith movie I Am Legend - in his laboratory basement... OK that's just me. Every six months my music, movies and pictures would outgrow my capacity and I bought a new one. Fortunately drive capacities were going up. 100, 200, 250GB. It was hard to imagine I would fill these up, but I did. Soon was dragging around the Seagate drives that eventually became the FreeAgent Pro series. A blue one for my business information and a red one for my music. Capacity finally but capricious.
Like clockwork every year for three years straight one of my major hard drives would fail. By 2007 I had a terabyte at home spinning in various disks attached to my Dell tower. By then I had become familiar with two vital tools. The first was a disk recovery system called simply Recover My Files, and the second was disk management program called File Tree Pro which had a deduplication facility as well as a heatmap visualizer.
Disaster recovery became an issue. When your collection of data becomes larger, it becomes more precious because you realize you've been collecting it for years and you'll do anything to retain it. Over this period of failing drives, I realized nothing was foolproof so I had to make friends with redundancy. I kept a copy of all my iTunes music on my Dell tower and also on an external hard drive. And because of the way Recover My Files worked, I had up to half of my files named Recovered 039203.mp3 or some such. So some of my files were redundant and I knew it, some I didn't know. But now I had a class of files I couldn't identify but dare not throw away. I know that the next time a drive fails, I want to have a duplicate but now I have all these unidentified files in iTunes and that slows iTunes down. So literally for years I would see a song 'Recovered 034930.mp3' and I would have to listen to it, identify it and change the tags. Metadata management hell.
Recovery meant I ran out to Fry's on a Thursday, bought a new drive, plugged it in and dedicated 100% of that machine's processes to scour the drive. Maybe by Sunday I will have gotten through 200GB. Recovery is non-trivial. Deduplication is both necessary and risky.
In the normal course of dealing with non-big data, having multiple versions makes sense. Deleting and restoring doesn't take much thought. Backup is simple, disk is cheap. Dealing with big data, deleting may not make sense, restoring takes time, yes disk is cheap but now it seems very slow.
I've had websites for a long time. I've also had FTP. But I never considered FTPing anything but the most important files up to my website for safekeeping. Like most folks I used Google Docs. Then I got Flickr and Picasa to hold my pictures. It was a long time before a web-based music storage service became available. But I did start using Carbonite. But for all of these services, if I wasn't creating content at that web location, I always had to deal with copying data from my machine over the web. Painstaking and slow. The first real breakthrough came with Dropbox, and then with Backblaze. These two services (finally) let me keep my data locally and worked in the background without intervention from me to make copies of my data. Now I had smart duplication, the kind of redundancy that didn't have the failings of my own recovery and backups. Files were duplicated in a separate place, I could identify all of them, and I didn't have to spend 100% of my processing time and attention making it so.
When you are dealing with big data, you need to have this kind of backend process working in your favor. Hardware will eventually fail, networks are too slow, metadata management is an awful thing to do manually.
--
Today I have the following personal setup.
I run everything from my MacBook Pro which I have attached:
OneBox - 1TB laptop hard drive
Bluebox - 320GB of business related files
Silverbox - 3TB of personal, non-media files
Whitebox - 2TB of media files
BBox - 1TB of backup of my main 1TB hard-drive
ZBox - encrypted information
Backblaze - backup of all the above except for BBox.
DropBox - 10GB of documents including ZBox (redundant to OneBox)
BlueKey - External USB key including portions of DropBox & ZBox
iTunes Match - 25,000 duplicated music
PicasaWeb - some ungodly amount of duplicated pictures
Flickr - some unique fraction of same pictures + more
AmazonMusic Player - about 5000 MP3s
Audible - all of my audiobookx
Amazon - all of my eBooks
OReilly - the rest of my nonAmazon eBooks
Evernote - some 5000 documents
iCloud - Contacts, notes, calendars (also on iPad, iPhone)
So I have all of my data locally, most of it 1X redundant locally, the more important 2X locally redundant. I have all of my data remotely redundant, some of my music and some of my pictures 2X remotely redundant. I manually initiate BBox backups which are incremental about 15 times a week (Apple TimeMachine)
My most important pictures in the world are on my keychain.
Additionally I burn various USB sticks, DVD-ROMS and SD Cards with the more important fractions of important files and stash them around the house and in the garage, and physically offsite. Plus I have some stuff in Amazon S3 but I forget what.
What I don't have redundantly is all my old software, which lie in Case Logic CD cases on my bookshelf. It turns out that almost nothing is as useless as old software. Even Eudora.
I have come to trust the cloud for backup, but there isn't much that I have there that I do not keep locally, which basically means my blogs and what I've said on FB, Twitter and the dozens of webchat forums I have attended over the years.
What I have learned is that at some point, you will outgrow your current systems, even if you didn't plan to. When you do there will be barriers you cross (for me it was 100GB collections) that force you to deal with new systems, new problems, new costs, but mostly new habits & new processes.
Then guess what. It's just data again. Yeah, I have about 6-7TB of data, not counting redundant backups that are my personal data. I spend maybe 2 minutes a day making sure it gets backed up, and almost no time worrying about it.