Data Warehouse Modeling: Is Kimball Still Relevant?
In this article, I highlight key concepts from Kimball’s Data Warehouse Toolkit and compare them with my experience in data warehouse modeling on modern data stacks.
Which Theory to Use?
Data warehouse models require thoughtful design planning if they’re going to be efficient, scalable, and user-friendly. As a member of a full-stack data team in a scaling startup, I need to understand and apply modeling theories to the warehouses I build in dbt. But which modeling theories?
In academia, Kimball is presented as the main (if not the only) resource for warehouse design. On the field though, my experience has been different. While some companies still swear by Kimball (i.e. Shopify hands out a copy of the Data Warehouse Toolkit to all new recruits), many arguments are made against it, and other theories and best practices co-exist (read them here, here, and here), creating gray zones and ample room for interpretation.
Kimball In My Data Practice
After studying Kimball in school, practicing warehouse modeling for two years on a modern data infrastructure (Fivetran, dbt, BigQuery, Looker), and reviewing Kimball again more recently, I decided to take a step back and reflect on my interpretation and adoption of Kimball.
Concepts I’m Keeping
Despite having been first published in 1996, many concepts from the Kimball theory stand the test of time. I’ve identified three that stood out to me as key ideas that analytics engineers still benefit from using on modern data stacks.
1. The Publisher Metaphor
Kimball suggests that data warehouse architects compare themselves to publishers: just like publishers optimize publications for readability, architects should optimize for data consumption. In my opinion, this metaphor is absolutely in line with the “data as a product” philosophy, as recently laid out by Emilie Schario and Taylor A. Murphy, Ph.D. It motivates data teams to keep the end-user in mind throughout the building cycle.
2. The Enterprise Data Warehouse Bus Matrix
Kimball’s Data Bus Matrix is a simple table that summarizes the warehouse design plan at a high level. It lists all of the facts and dimensions that make up the business processes and maps the intersections between them.
In an agile environment, blueprints for future projects often feel like paperwork that will either get lost or become obsolete by the time it’s needed. To avoid this pitfall, architects should focus on documenting only what is necessary and do so in a very user-friendly communication style.
What I love about Kimball’s framework are its simplicity and completeness. It is easy for users to understand and maintain, yet it contains key design instructions that, if followed, even over multiple building sprints, will keep the warehouse in good shape.
3. The Kitchen Metaphor
In Kimball’s kitchen metaphor, warehouse architects can be compared to chefs in a restaurant kitchen. Chefs are responsible to select the best ingredients and are entitled to refusing low-quality ingredients. This concept translates well to SLA-backed data sources (for example Fivetran’s SLA). When source data doesn’t comply with requirements, data teams should be entitled to raise a flag and get the source fixed without having to negotiate with data suppliers.
Concepts I’m Skeptical of
Some of Kimball’s concepts apply to environments and architectures that I haven’t encountered yet. Here I note ideas that seem obsolete, but it might just be that I don’t have the experience to fully grasp their applicability. I would love to hear from other analytics engineers on their experience with those.
1. Everything Is Fact or Dimension
From Kimball’s perspective, every table in the data warehouse should be either a fact or a dimension table. I think that makes sense, and warehouses that follow this model probably scale very well. However, over the course of the last two years, I worked with less standard structures, and haven’t encountered major issues.
One of the factors that weigh in on the necessity to use a Kimball structure is the size of the company and the profile of the data users. For companies with less than 50 employees, I think that there is no screaming need for a Kimball-modelled warehouse yet. That being said, the benefit of drafting a data warehouse bus matrix and starting to gradually build facts and dimensions accordingly is probably worth the small investment.
2. Siloed Development and Data Teams
Kimball assumes low collaboration between development teams and data teams. This assumption has benefits, such as keeping data teams accountable for warehouse contents. However, it also puts a heavy burden on them to build models that are fully independent of their source data. This might not be optimal, especially for smaller companies that produce their own application data.
Source-independent warehouse surrogate keys are a good example of this costly assumption. Development teams build their own set of keys, and then analytics engineers re-build them in the warehouse, duplicating the company’s investment in the dataset.
In my experience, data and development teams can be highly collaborative. For example, at Breathe Life we had an ongoing Data Engineering Guild that brought the two teams together to discuss and address data projects collaboratively.
3. Non-SQL Savvy Users
Another Kimball assumption is that data warehouse users are not SQL savvy. This is the cause for certain Kimball labelling standards. For example, the theory suggests that value labels be explicit (i.e. boolean columns should presumably contain more explicit labels than the intuitive “true” or “false”). I personally think that this makes writing “where” clauses more cumbersome, while adding little value to warehouse users.
In my experience, warehouse users have been pretty SQL savvy. Most of the time they were mainly the members of the data team, so SQL knowledge is implied. If non-SQL savvy employees wanted to access data, they either did so by asking a question in a Slack channel or via a self-serve environment in Looker, where users did not have to use any SQL language to do their queries.
Keeping Some and Leaving Some
Reviewing Kimball for a second time definitely cemented key concepts that are required in my role. While I don’t think they should be applied systematically in all situations, they remain great resources for analytics engineers. Next time I run into problems such as duplicated computations or code-heavy BI layers, I’ll safely turn to Kimball to optimize our data models in the warehouse.