Here's the highlight reel. When Peter Drucker invented the MBA part of the idea of professional business management was that ordinary individuals could be trained to run a business. Not a generalized business, but a specialized business in some category. (NAICS codes). The way Harvard has been teaching case studies was that within the scope of business and product analysis such a cadre would be measured and their public company made an indicator by P&L and balance sheet metrics to the public shareholders. IE, IBM competes in its NAICS sector, becomes a market leader taking product share away from competitors, then you know to buy the IBM ticker. All the analysis is around product competition and market share and the effectiveness of management has everything to do with its ability to execute in that narrow area.
This doesn't work for general contractors. This doesn't work for hospitals. This doesn't work for software. This doesn't work for conglomerates. Shareholder maximization only works for certain kinds of single focus businesses. So long as the balance sheet works, in this narrow way of looking at business performance, you can justify all kinds of trickery which is not good management or good governance.
So there are only a few management teams, relatively speaking, who know how to make a company profitable on narrow profit margins. A conglomerate can have a balanced portfolio of businesses that won't suffer the business cycle. That's how Berkshire Hathaway makes money. By running multiple businesses in different NAICS and not making their whole company dependent on a killer app. Each of these businesses, after a time, contribute to a large pool of funds that allows the directors to make experimental buys in new areas, instead of mergers in the same area. Bezos has said specifically that Amazon can afford to make billion dollar mistakes, because the underlying businesses are already fully capitalized and don't need to keep growing or be leveraged before they are mature.
People have wanted AWS to be spun off from Amazon for years, and Wall Street has been hedging on the stock price because before it became a market leader, he never broke out it's P&L and balance sheet from the whole of the company. Wall Street for the most part doesn't know how to do company management analysis either. That's why index funds are winning. Anyway, conglomerates are safer from hostile takeovers because the executives actually understand how to run their businesses and there is no incentive to sell out the business. That's why a class of Silicon Valley VCs and entrepreneurs loathe the idea of working for Amazon. There's no exit strategy. You don't exit conglomerates. They scale horizontally.
It's interesting how well Bezos has run it as a conglomerate, given the bad news recently about GE. I don't think, outside of P&G, that most Americans understand conglomerates. So I'm hoping that Amazon does a stock split and gets listed on the Dow.
Yes, it’s possible. It’s coming from a side angle though. Basically Congress is going to have to admit at some point that the credit card companies keep better track of individuals. And then the credit card companies will have to admit that the new crypto devs have a more secure way of keeping individuals in control of their own identity. And then somebody at Apple is going to make it easy to integrate anonymous identity with iCloud and ApplePay.
Basically, the interest around blockchain and cryptocurrency is going to drive the development of secure identity management and peer to peer payment tech.
If you look at something like Keybase you can get a feel for how a single ID can federate other well known IDs.
I forget the name of this other company, but they have the ability to take your one credit card and make it authorized for all your online purchases except you use a different ‘credit card number’ for each transaction. So your single, real credit card ID is never used online. It’s just used to authorize, from your desktop (or thumbprint as Apple has proven), an on-the-spot value card. This technology is similar to Amazon KMS and HashiCorp’s Vault. IE it’s already done and well understood.
This combination of technologies already exist. The question is basically who is going to mainstream it best. I bet Apple will. Now let’s say they do that next year or late this year. How many Congressmen do you have to buy in order for them to
Authorize it’s use for Government things.
Get bureaucrats across the US to understand it.
Yeah almost never. Which means somebody needs to hack the other two credit bureaus to prove what we already know. Current identification schemes that dominate US commerce are insecure and outdated. Then of course you’ll have to deal with millions of paranoids about the possibility of a National ID, seeing as it’s ‘racist’ to have people show a driver’s license when they vote.
Bottom line. Stop hoping for the masses. Do the math, figure it out, and do it with people who already get it now. And yes I did triple my money in Bitcoin (and in Amazon stock before that (and in Google before that (and in Netscape before that (and in Inktomi before that)))) and no I’m not hurt by the price collapse today.
On the other hand, there will be people who will slowly duplicate old tech with new tech, and maybe they’ll be able to sell this stuff. The old dis-spirited workers at the NSA and CIA are going to have to work somewhere.
Once upon a time there was a mainframe computer. I ignored it. So did you. We were wrong.
My mighty mainframe was a CDC Cyber running NOS. It was such a craptastic machine that I hated doing homework precisely because the damned thing wasn't reliable enough for me to slack off until I was ready. Surely when I got into a groove of programming in fricking ADA, it would be broke. I hated my college's datacenter and the frogs who worked there. It was enough to make me love microcode and BNF. But that was 30 years ago, literally. Nevertheless, it left a bad taste in my mouth particularly because I loved DEC Vaxen and client server. After all, I did work at Xerox.
But while I was at the big X, I did come to love and appreciate VM/CMS. What a cool idea. And there was always a weird kind of appeal to ISPF, I have to say. Plus it was also very cool to use a channel attached Overland Data tape drive that worked with IND$FILE. Ok enough reminiscing. We all know that IBM was an early embracer of Linux, and became even cool again when I read about something called a Beowolf cluster of Linux nodes running on a System 390. And of course the RS 6000. Yada. Yeah. They understand hardware. I even listened up to the time when they started on about The Grid and grid computing in general. Why? Because I have always been a big data fanatic. Ok, devotee.
So this morning I'm talking to my engineering manager colleague down in LATAM and he's helping me to understand what sharp elements of his team have been doing over the past few years. I know a bit about it, but now I'm responsible to communicate that. Non trivial. It turns out that they've essentially been packing puffy clouds around mainframes. Huh? What?
We have come up with a way to put legacy mainframe data systems into cloud native architectures.
One of the things we do is work with airlines. I don't have to tell you again. You know. And a whole lot of airline reservation information, flight routing, frequent flyer currency, etc are on a multiplicity of divergent systems and architectures. We don't have patience with all of that. We're a systems integrator doing what needs doing in the AWS cloud. Of course a lot of this data is deeply embedded in complex systems that it will never make sense to re-engineer. So we said, leave it where it stands. We built stuff around it. Our offerings are called API Modernization and Middleware Modernization Blueprint. And what we've been able to do is engineer and rationalize what lives best where in an extended hybrid cloud environment. One of the things we've done is applied Scala Akka frameworks to create RESTful interfaces and GraphQL to smooth out the rough edges of ancient mainframe subsystems and scale them up to deal with the new real world of cloud & webscale applications. Another thing we've done is take complex business rule logic which was once a legacy monolith, generalized and retooled it to work in multiple applications. We've augmented or replaced IBM MQ message queues and worked them into both Kafka and Kinesis. We prefer the headache free Kinesis. We don't have patience with zookeeping. Naturally, we've used the power of Cloudwatch to provide more reliable and customizable system monitoring capabilities, and of course we've used the power of AWS Availability Zones to make the whole thing robust against failure. So yeah, we can think of your mainframes as a middleware component in our ever-evolving data management architecture.
Obviously this is not cutting and pasting technology. It requires deep thought, patience and understanding. We've got our share of that and it's working out nicely. It's not easy to communicate all of those details, but I wanted to give you a heads up so you could consider some of the interesting directions our quest for data perfection has taken us. It has taken us back to the legacy of centralized computing, and we have recast that command and control to fit into contemporary cloud architecture. What a journey. Hey mainframe, we're pals again.
This won’t take long. I’m still in the mood to rave. I’ve fallen in love with my new secret keeper: Biscuit + KMS. Biscuit is a multi-region HA key-value store for your AWS infrastructure secrets.
So I’m really a big fan of Hashicorp and so are the rest of us at Full360. I’ve been using Packer and Vagrant for a couple years now, and I just became dangerous with Consul last fall. Now I figured it’s time to learn Terraform and especially Vault. Except I don’t have as much time as I used to. Still, I’m relatively paranoid about security and I don’t like hiding and unhiding volumes to grab pem files and whatnot. My parameterized setaws.sh of customers for which I have AWS access keys and secret keys, exported into environment variables is getting rather cumbersome. So yes I should use something like Vault. But. Vault is cool and complicated and I don’t want to use a little fleet of my machines to support it. I’m not going to be granting temporary access to IAM or other roles (but Biscuit does grants), this is just all about me maintaining some passwords and stuff for a dozen VPCs or so. I want to keep it simple.
So it turns out that Biscuit is just what I need, so far. It basically took me about 2 minutes to make my GOPATH actually make sense of that go stuff that I did and forgot about last year, then I followed the simple instructions on Biscuit’s Github. It took 2 minutes and 33 seconds to initialize the KMS stuff in three regions and then I was good to go. Easy as all get out to get up.
The coolest thing about Biscuit is that the local file that keeps all the secrets I want my containers and repo’d code to eventually get to is something I can repo without worries. I presume that I can set up a role for any machine that would run Biscuit and that the redundant KMS handles the rest. So far so good. Do check it out.
I noticed some folks reading through the Redshift category and noticed that I haven't written anything new for a while. So here's what's new:
We see that Redshift has improved its vacuum capabilities and added more functionality all around. It has improved its performance all around, but it hasn't changed its overall performance characteristics. Redshift is not fundamentally different after two years. It still behaves like Redshift when compared to Vertica, the other MPP columnar database we support at Full360.
We have been able to learn quite a bit more about tuning Redshift. Full360 will be offering this service soon. It's called Upshift and it is surely the most comprehensive performance evaluation available in the industry.
I have to qualify all of this by saying that I personally work a lot more with Vertica than I do with Redshift. These two products may seem very similar but the details are often overwhelming. Fortunately we are developing a methodology that expresses the rules for optimization very well. So while the characterization I've made still holds true, there will be a growing number of exceptions and interesting circumstances. I call Vertica a magical sword. It is powerful, precise and it sharpens itself. I can cut intricate and delicate patterns, and chop hundreds of large heads. I call Redshift an ogre's club. It is massively powerful, brain dead simple to use and relatively inexpensive. So you basically have to look at your application and know whether or not it is a job for a club or a sword. Our methodology will tell you exactly which, but like I said, the devil is in the details and we are wrangling dozens of demons.
The good news is that both products are improving at a good clip. Still, I confess I'm paying more attention to Vertica. I'm really impressed with the overview I got yesterday on Vertica 8. They've optimized some of their geospatial algorithms. They've incorporated several ML features directly into the core product. They've dramatically improved their integration with Hadoop, Spark and Kafka. They're claiming to perform 160% of Impala. So that's superb. Most importantly there is enthusiasm for the creation of the new company, which is less like Vertica getting sold to Microfocus and more like a rebirth of Microfocus itself who is, by the way, the owner of Suse Linux. The Vertica guys are thrilled that they'll be working for a software company. That means the upcoming integration with S3 is serious as is their priority on cloud implementations. All good.
A couple weeks ago, something profound occurred to me about enterprise software. I realized that when it is priced by server or by processor, that it's a ripoff. When it is priced by data usage, it's a bargain. You are paying for the potential of being able to use 8 cores even when you are not using them. Admittedly this is rather easy to see if you have experience working in the cloud and then take a turn doing on-premise practices, but it seemed rather profound to me. Imagine, if this is not obvious to you, that you pay a standard or premium license fee to your mobile phone carrier based upon whether you are using a brand new smartphone or an old one. Right. It makes no sense to pay money for anything but the minutes you use on the phone. Mobile phone billing is done right. You pay for minutes. All software could be that way, so that you're not paying for servers, but for functions that spring to life do your bidding, charge you a fractional penny and then die.
This week my boss sent me a link to a guy named swardley who likes ducks. It turns out that this character is quite capable of blowing my tiny mind. He's done it twice already, and so now I have the burden of a speck of enlightenment. It is an enlightenment which is commensurate with my understanding of (Wladawsky-Berger 1999) n-tier computing and then (Vogels 2008) horizontal scaling. So since I've been doing cloud for six years, I now understand what's happening next. God save us all.
Tying software development to economics and cost accounting has long been the stuff of magic, SWAG and charlatans. At least that's what it seemed like for most of my career. But I think Simon Wardley has the solution. He has outlined an ecosystem and a framework for understanding how one can iterate (captive) algorithms towards mutual value for the developers and the customers. He calls it FinDev.
Like most useful thinking on the progressive edge of IT, one must assume AWS. That is to say, very little of what exists in the world that is not part of AWS's ecosystem can be thought of as having great potential in the future of computing. AWS is beyond doing things very well, they are evolving at a monstrous pace and at enormous scale. As an aside, I asked some Agilists last night at El Torito why Amazon manages its businesses so well. He said it's because Amazon is a collection of relatively small businesses that work on a common billing system, and that's what keeps it simple to manage. They are not only eating, but profiting from their own dogfood, which is the fully meticulous tracking of compute resource costing, and now with Lambda, down to the function. All the hardware is a sunk cost. Cloud computing is a utility. What matters now is billing by the function. Simply assume the cloud. It's already done.
So there is a scary aspect to this which is something we all should have been afraid of all the time. It is what happens to craft when things become industrialized. Your personal touch matters less in a market which is defined towards optimization, cost-cutting and efficiency. Nobody cares how you show off the horse, we're all driving cars now. Nobody cares about your budget system, we're all using SAP now. Nobody cares about your fat client, we're all using browsers now. What's coming is a COTS revolution in which your college professor optimized the Towers of Hanoi solver and now owns the moneyTicker on the algo. In a global library of cloud interoperable functions, the scope of what you get to work on gets narrower and narrower. The good news is that we are 20 years away from lockdown. The V8 of compute engine economies is invented. Say hello to the next 50 years. You Wankels don't stand a chance. When I was an undergrad, I used to think of software as the same thing as law. There are lots and lots of lawyers but only a few legislators. The assumption was that the best lawyers at some point got to legislate and the rest just interpreted and borrowed citations for the benefit of those who never read the law. I believe there will be some measure of stare decisis in the new FinDev ecosystem.
So the future belongs to engineers who really know their customer's needs. The economy of FinDev provides value to developers and customers only to the extent that something can be built (at scale in the cloud) that customers want to use. When you charge by the use, that's a different business model than anything we've seen. Chances are it will be disruptive because it will go after captive inefficiently spent money. But there's greenfield out there too. More hopefully, there are new places computing can and will go once we wean ourselves from the economics of capacity planning, system depreciation, outsourced consulting and all that. I think AWS will be capable enough to handle global innovation in this regard; they're certainly leading. Now is the time to work our way towards best practices, evolving towards the revolution.
In Martin Cruz Smith's Arkady Renko series, the protagonist, Renko informally adopts an orphan who is a chess genius. Playing at the genius level, the kid doesn't require a board or pieces. He, and those like him, can just recite moves. He has a virtual queen and doesn't even need hardware. If you're thinking about physically moving pieces, you're not playing chess.
At Full360, we have developed a best practice around our design of greenfield and re-engineered DW applications. The following is a high level guide to how we accomplish this in Vertica. Vertica optimization is something we have pursued with vigor at Full360. There are several different levels at which this can be pursued. Implicit is the modularization of the applications so that the major functions of our data management philosophy can be expressed discretely. But let's get to the $10 words, shall we?
Idempotency Both I and JDub could go on at to absurd lengths about how important this is to the modularizationf DW design. I will simply, which characteristic casualness, tell you that it makes all of our stuff idiot proof, in that it makes our data provision dependencies kind of go away. The basic idea is that in application units(which basically means the chunks at which the data to be consumed makes process and business sense) you make your input streams discrete. So when your input streams are discretely chunked, you can run your process over and over without concern about whether it has been done once, twice or never. You just run that independent data provision job and it creates the right sized bucket of data.
Set Transformational VLDB folks are probably familiar with why you do ELT rather than ETL. The simple way of saying it is that database developers are more stingy and efficient with data than ETL developers. I developed a taste for hand-crafted 'ETL' back in the days when Informatica was a baby, and having my Unix biases, I always loved moving files around. At the time, my focus was on Essbase which had not ETL hooks, even though Arbor should have purchased an ETL company cheap. Interestingly tangential, Wall Street has never been very long in ETL companies. Anyway, I expect that Informatica and Talend will not like to hear that their days are numbered, but then neither did Carleton and DataStage, and they used to rule the world. The bottom line is that moving data from table to table using SQL rather than in a GUI that does not is going to be, in certain databases, much faster and more human readable than in third party tools. So we do set transformation, and even regex stuff, inside Vertica. One day we may even benchmark UDFs against external programs.
Denormalized Vertica like Redshift is not a transaction database. It is columnar and it easily handles 600, 800, 1200 column tables. It was designed to. So there is no reason to do a lot of silly little joins in silly little tables to get juicy fat data. We make all that part of the ingestion process, which gives us what we want. Think about it for a minute. Consider the volatility of lookup tables and dimensions as compared to the volatility of atomic facts and transactions, aggregated or otherwise. The facts will be bigger and more fluid. So why spend join energy on query exhibits over the long haul when you can easily have all the columns you want? You don't have to. There are no table scans from hell, that's what columnar solves. So we go big.
Production I'm not going to talk about the guts of Production other than to mention it briefly here. Production is some of the genius and we have a bag of tricks that is ever-expanding as we deal with realtime, near-realtime, and other odd types of data sources. Yes we lambda with streams and lakes, but we smart lambda. Again with whatever tech makes sense. Right now we're playing with Kinesis and Kafka and our own custom Actor models which we're sure will evolve over time. We're also looking at how to use Redis and other superfast KV stores. So I suspect we will grow many efficient tentacles as we Produce standardized data for ingestion into our columnar DW apps. Nuff said.
We ingest data into source tables for each schema as they come to us. No matter how many fields, large or small, we will take them in using a COPY from produced files. Whether in Vertica or Redshift we standardize on UTF-8 with vertical bars delimited and a backslash escape. In some cases, if we've munged up variable length stuff from our own custom regex routines or other JSON, AVRO or semi-structured effluvia, we will have an additional pre-step using Vertica Flex Tables. We are coming up with best practices there too.
These source files should also retain the original names of the fields of the produced data when possible. This assists in debugging with the original developers.
All of the data that is to be used in the application should then be mapped to a view. This is the staged data. Staged data should be of the increment of ingestion (discretely chunked in application consumption units). That is to say that your _SRC and _STG will carry the same number of records although they are likely to carry a different number of fields once the ingestion is done.
The clean tables should be materialized tables with human-readable fields that are conceptually discrete. This is generally accomplished through a direct insert from one or more _STG views. Clean tables can be defined with blank fields for further rule application. Clean tables that contain all history for the query space should have the term _HIST_CLN, otherwise one should assume that a clean table is the same size as an ingestion increment. Clean tables should be optimized for the scope of the transactions that take place in their creation. When you are looking at the clean tables, you have all of the data you need from all of the sources presented in the way you as a developer think about the data. There should be very little ambiguity at this point, it's your best look at the details before they are aggregated and rationalized.
Lookup tables are straightforward. They should be optimized for their transaction capacity with clean tables.
In a complex model, clean tables can be made into Intermediate Fact tables. The intermediate fact tables are materialized tables with all of the necessary dimensions that support measures that can be made across the full query space. These may or may not be exhibited directly, but should be useful in a partial analysis of the particular measures they contain. An intermediate fact table should be the place where window functions are applied. They should be the place where obscure field conditions are made into explicit attribute and status fields. It is important to know that an application may have dependencies of different measures that seem to be dimensionally equivalent, but actually aren't. So by using IFTs we unburden ourselves of the very idea of a single massive star or snowflake that might have holes. Also, we can capture all of the attributes of a set of measures completely without concerning ourselves with the weight of them in the final presentation layer.
So think about this. A clean look of the data will probably not have sensible status fields they will have codes. There may be multiple ways to interpret a certain combination of fields. So whatever you need to support a the full scope of consumable data, even if it means synthetically remastering transactions, you can do with IFTs beforehand. Building these are the real guts of the DW application, and it's where the fun of working with sharp analysts come in, especially when it comes to data integration projects. I will apply all of the reasonable and semi-reasonable business rules here. This is where I have enough detail to point machine learning, because at this point the data is explicit and human readable. It will reveal the more interesting cases and outliers. And here it should be very rich - beyond human comprehension, so yeah maybe you don't go full wide with these tables, only add 4 of the 50 psychographic customer tags you have against their ID, because, security.
Exhibit tables are materialized views in some cases but generally views that are presentable to the end user. These should be indexed and optimized for query retrieval. Security access rules are applied to user groups, etc only to exhibit tables. It should be assumed that none other than dbadmin processes will have access to any tables or views but the exhibit tables.
So there it is. This is rather the state of the art that I have internalized in five years of cloud based columnar data warehousing. Maybe I should write a book.
I just put together a quick casual video covering five questions about trends in markets and customers that we're seeing. It's nice to see that our experience is exactly dovetailing with theresearch put out by Gartner's latest Magic Quadrant
Here's a link to the webinar I did last week with VoltDB. Having Volt as part of our architecture has enabled us to think about a whole new class of applications. Right now, I would say that we're at the point where we're really ready to deal with massive IOT streams. It's just a matter of getting the right people together. These days I'm brainstorming this kind of stuff, excited as I am by looking at real-time events and figuring out what I need to drive those data dogies.
In this presentation I talk about three different apps that are part of our Fast Data portfolio. All of them are real customers using this technology in production as part of Full360's manage service offerings. We designed these apps and built them as well. All of them run securely and reliably in AWS VPCs and customers love them. These are exemplary of our multi-tier DW framework that we call elasticBI. Why Panigale? Because this is all about making very fast decisions in real time. A second late is too late.