Data Warehousing Has Evolved
Data warehousing is a decades old industry. Volumes of money invested solving similar problems across varied industries. All for the promises provided by data warehouses: New insights. Innovative discoveries. Newfound profit centers. …the list goes on.
To be sure, there have been both grand successes as well as grand failures in this space since its inception. Many of us have witnessed or participated in them.
After such an extended period (measured in tech years), why are data warehouse efforts not enjoying a consistently high rate of success? There are decades of thought leadership and education in the space. Why has that not produced a larger pool of expert practitioners operating within a consensus on how to produce consistent results? A consensus that aspiring practitioners can refer to and replicate those successes?
To be fair, many reference architectures exist:
- Corporate Information Factory (CIF)
- Dimensional Modeling (Star Schema)
- Data Warehouse 2.0
- Data Vault 2.0
For someone new to the industry, how would they wade through all these options and choose the best approach? How do they read the voluminous varied vitriol and come out with a clear idea on how to proceed?
What exactly is the issue? Perspective.
“A little perspective. That’s it. I’d like some fresh, clear, well-seasoned perspective. Can you suggest a good wine to go with that?”
― Anton Ego, Ratatouille
Many reading this are technologists, and I can quickly agree that part of me is seated in this camp. A technologist’s mental energy, too frequently, is spent mulling over technological choices. Which storage sub-system, cloud provider, database platform, whether to buy vs build, etc. Believe me, I enjoy discussing row versus columnar storage; SAN vendors, LUNs and RAID; Snowflake, Redshift, Azure DW and the dozens of other database vendors and that you’re upset I didn’t mention your favorite.
It’s not that these choices aren’t important. They are. However, they shouldn’t be the primary focus of our mental energy when building a data warehouse. Well, at least, a successful one.
To provide some perspective, I’ll take a short walk down memory lane.
My first two decades of life, exposure to technology was limited. My dad brought home an Osbourne 1…in the early 1990s that served as a curiosity piece. I did build my college student profile web page with the requisite “Under Construction” and traffic cone images on it.
My interest was business. A member of DECA, I ran the student store in high school. In college, I led the Young Entrepreneurs Club. In fact, I didn’t own a computer until after college, marriage, and my first child. It wasn’t until I took a course on databases, that my career trajectory took a sharp turn into technology. I spent many years creating technological solutions, automating cell phone activations with client-server architecture, creating web-based CRM software, tuning SQL query response times from 22 ms to 17 ms. What glorious fun!
I was focused on building the fastest, most scalable and feature rich software applications. I gained mountains of knowledge and expertise in a variety of technologies (kids call it “full stack” these days).
Paradigm / Perspective Shift
Intro Data Warehousing. My first exposure to data warehousing was not a willing one. I was tasked with producing better data from a CRM application we created; three call centers’ customer interaction data.
While researching data warehousing, my business sense flooded back with this realization: the information produced and recorded by the application is where value comes from. Yes, an application may enable an activity or process, but the data from those activities has even greater value to the business. That information informs the business whether their past and current activities produce the desired results. Ideally it informs business leadership whether they need to change course.
Let’s view this whole process from business perspective. Business leaders need to know the answers to various questions around the operation of their business activities. It is precisely the business view of these questions that determine whether the data received provides relevant answers.
Business leaders do not get valuable data because a technologist chose one platform over another. Ultimately, questions are answered when a technologist and business leader stand in the gap to ensure what the business needs to know remains the focus. Whether filled by one or multiple people working together, they drive discovering and defining data from the business perspective.
Bringing It Back
Why have existing reference architectures, such as CIF, Star Schema, Data Warehouse 2.0 been ineffective in many instances?
A key role is missing. The person standing in the gap.
Existing methodologies come from one of either perspective, the technologist, or the business, but not both.
Kicking The Can Down The Road
I refer to traditional data warehouse approaches as kicking the can down the road. These efforts take the technologist-first approach. Driving design and architecture based upon how applications decided to organize data. The technologist has a job: get data from source systems and deliver it to the data warehouse. They are checking boxes on a list of requirements to get the job done. What we’ve found, is this approach may store historical data and all in the same location, however, it produces something akin to a junkyard.
Have you had the privilege of hunting for car parts in a wrecking yard? It very well mimics the experience with traditional data warehousing:
You approach the service rep and explain what you are looking for. Through hand gestures and vague verbal guidance, a few vehicles are suggested, amongst the few acres of wrecked automobiles. You’re on your own to retrieve the part and make sure it works. Finding the first vehicle, you discover the part is missing or damaged. Off to the next vehicle. You find the part but must extract and clean it up before using it.
The traditional data warehouse approach collects data but kicks the “analysis” can down the road, leaving data analysts and report developers to figure out how to make sense of the data.
This can cause disastrously disparate definitions for the same metrics. Governance becomes difficult, at best. In one organization, I discovered four definitions for Revenue, with wildly different values.
Putting Lipstick On A Pig
Dimensional Modeling was supposed to solve issues with the technologist’s approach. This approach attempts to dress up source data and make it pretty for the business. As a plus, Facts and Dimensions are easier for business minded consumers to understand. Dimensional modeling seemingly provides a better business solution, however, quickly becomes unwieldy and inflexible to change. Scalability issues combined with technical issues when solving various data scenarios. Business leaders have attended too many meetings about Type I, II, III and VI dimensions.
A major difficulty is dealing with changes to business rules. Once design decisions are implemented and data manipulated by business rules it is increasingly difficult to adjust existing data when those rules change. Reversing modifications made to data is often untenable. The concrete sets very quickly in a dimensional model.
Putting lipstick on a pig has questionable value. Adapting at the speed of business is not well tolerated. Applying lipstick to the pig a second time, after it knows what you’re up to, is dramatically more difficult!
Too Many Choices
With existing methodologies, a lot of technical decisions must be made for each data point. How to store it. How to combine it. Whether to maintain history. History of what?
With too many choices, inevitably the progress of the team becomes slower. The explosion of data and data sources exacerbates the issue in data warehousing.
Whether to get around limitations inherent in the methodologies or due to poor choices made while implementing a methodology, data consumers often conduct painful data acrobatics to get to the answers they need.
Do we have a solution? Yes.
Data Vault 2.0
“Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius — and a lot of courage to move in the opposite direction.”
― E.F. Schumacher
“If you can’t explain it to a six-year-old, you don’t understand it yourself.”
― Albert Einstein
Data Vault 2.0: the dream of the technologist and business leader working together, achieving their individual and shared goals.
Agreed, that is a high promise for sure.
Having built successful data warehouses using most approaches, I attest the success is predicated on maintaining dogged focus on the needs of the business.
I have experience with Data Vault for over a decade and understand the advantages inherent in its design patterns. With only four articles for reference that Dan Linstedt, the creator of Data Vault, published on TDAN.com, my team quickly benefitted from the advantages it provided.
With Data Vault 2.0, Dan has extended Data Vault to a “System of Business Intelligence” including best practices not only in design, but the entire scope of Enterprise Data Warehouse build-out, delivery and operation. Data Vault 2.0 has taken the concept and promise of Enterprise Data Warehousing to the next level and standardized it. Frankly, to a level most organizations have not experienced.
Clear, concise definitions are important to any discussion.
Data Vault 2.0:
A System of Business Intelligence containing the necessary components to accomplish enterprise vision in Data Warehousing and Information Delivery
The pillars of Data Vault 2.0 are Methodology, Architecture and Model. Understandably, information around Data Vault still tends to focus on the Model pillar. Its scope is much grander, however, and within each pillar People, Process and Technology are addressed.
Why Data Vault 2.0 (DV2.0)? In short, there is beauty in its simplicity. Scalability inherent in its design. Published standards to follow. Easy to understand. Ripe for automation.
I will focus on two aspects I find extremely compelling.
Side note: I will refer to the Raw Vault as Data Vault for simplicity’s sake.
To separate Hard versus Soft business rules, we must first define them.
Hard Business Rules
DV2.0: Any rule that does not change content of individual fields or grain
This leaves some room for interpreting how rigidly you adhere to it. Some examples of hard business rules:
- Data type alignment
- Normalization / Denormalization
- Tagging (system fields: e.g. Load Date, Source System)
- Splitting by Record Structure
Soft Business Rules
DV2.0: Any rule that changes or interprets data, or changes grain of the data
Essentially, any rule that goes beyond a Hard Business Rule.
Examples of a soft business rule:
- Concatenating fields (e.g. First + Last Name)
- Standardizing addresses
- Aggregations / Computations
Why the Separation?
Per the DV2.0 standard, only hard business rules may be applied prior to the Data Vault.
This ensures data auditability is maintained. A Data Vault can be used as the single-source-of-truth because data has not been manipulated beyond hard, enterprise business rules.
This guarantees historical data in the Data Vault, is representative of the source system at that time.
Soft business rules are reserved for use downstream from the Data Vault.
Data Vault 2.0 only specifies Hard versus Soft business rules. I recommend organizations view Soft rules as a gradual softening as data flows downstream. Moving from enterprise soft rules (calculating revenue, data quality rules, etc.) towards ad hoc rules data analysts/scientists might use (testing theoretical calculations, relationships, etc.)
I like to think of it like a gradient of hard to soft boiled eggs:
The beautiful thing is, when a soft business rule changes, and they do change, data can be retrieved historically from the Data Vault and re-populated using the new rules.
This allows downstream models to absorb changing business rules. It also enables the business to non-destructively test proposed business rule changes.
I worked for an organization that implemented a sales structure change. They redrew sales territories and how revenue rolled up. Without a means to test how the changes would affect public reporting, they implemented the change.
The changes made the publicly reported numbers look terrible. Within three months the changes had to be rolled back.
With Hard and Soft business rules properly segmented, proposed changes to Soft rules can be implemented in a temporary data pipeline. Historical data from the Data Vault can be fed through the rules to temporary dimensional or other models.
Data scientists and business analysts can inspect the data. If the results look good, the business can proceed. If not, delete the temporary pipeline and tweak or junk the proposed rules.
Ripe for Automation
Data warehouse automation has been the best kept secret in data warehousing and is finally getting more attention. The problem is existing methodologies have rough guidelines and many different decisions that can be made for similar objects. Depending on the architects, modelers and engineers behind the keyboard, the choices may be conflicting or simply changing with employee turnover.
Data Vault 2.0 has published standards. These standards define which patterns to use in which scenarios and allows data warehouse automation to jump to the next level as well. After decisions are made around business keys, relationships and attribution, automation can take over and build all of the objects and ELT code to properly move and store data properly.
Many teams have been speechless after participating with me in sourcing, modeling, deploying, and loading their source data into a Data Vault 2.0 warehouse in less than 3 days. Deploying on Snowflake and processing 100s of millions of records at the same time is a game changer.
Wrap it up
Data warehousing has suffered with methodologies battling from conflicting perspectives. Data Vault 2.0 bridges the gap by prescribing guiding principles from a unified perspective and patterns proven out through 30,000+ test cases.
Be aware! The bar has been raised in data warehousing through Data Vault 2.0 and automation.