What Does A “Conformed Dimension” Actually Do?March 12, 2023
Every so often the “star schema is dead! dimensional modelling is dead!” discussion rolls around again, and we end up in endless debates about the star schema, the alternatives, and I always end up not quite so satisfied with those alternatives (but that’s for another day). Some of those discussions can go off piste a little, because the star schema can be used in a data warehouse, a data mart, and a BI tool model, it has flexibility of usage that perhaps designs like data vault do not.
So sometimes we’re not comparing apples to apples (hey, star schemas load and present data quite nicely from data vaults…so they’re not always in competition). And let’s not forgot a while ago Bill Inmon would be talking about using the star schema to serve data from a 3rd normal form enterprise data warehouse (before all those clickbait lakehouse articles…).
Another Sunday, another thought rolling around my head about why I use dimensional modelling
So here I am again, another Sunday, another thought rolling around my head about why I use dimensional modelling, why I love working with the star schema design, and why I advocate using for data warehouse/data mart design. Well, yet again a piece of work I was involved in used one of the constructs of the dimensional model I appreciate the most. In this blog I’d like to dive into Conformed Dimensions, because even though I understand it in a certain way, I was curious about what others thought the definition of the term is.
Thanks to all that contributed to the discussion on Twitter, I’m genuinely blown away by all the responses, thoughts, and debate that is generated. Always constructive, always informative, and I get to see things from different perspectives. I won’t go too much into the responses here but they fell into 2 distinct areas:
- A dimension that can be shared across different facts (different business processes)
- A dimension that returns a distinct list of values that have been mapped together from source systems
Let’s go through these 2 definitions, please note that I’m not looking to copy/paste the definition from the works of Kimball and Corr. I fully recommend reading the definition of conformed dimensions in Kimball’s The Data Warehouse Toolkit and Lawrence Corrs/Jim Stagnitto’s Agile Data Warehouse Design because they will go into much further depth than I will in this blog (e.g. shrunken rollup conformed dimension with attribute subset…now where’s the wide-eyed emoji!).
Conforming Business Processes
What do we mean by conforming business processes? Well, it can be boiled down to being able to use the same dimension across multiple Facts. With a Fact it’s worth recapping that this is a specific business process that needs measuring, materialized as a Fact table. E.G. both Web Telemetry and Sales analysis are separate business processes. I’ll use these 2 business processes in the scenario below, suffice to say the scenario is an online retailer capturing both the sales events, and the website product browsing events. The business wishes to analyse both the sales events and the web telemetry events using the same product dimension.
We see this example below a lot when talking/reading about dimensional modelling and the star schema. We see a single Fact table in the centre of the star and associated Dimension tables connected by surrogate keys as the star points.
But what about multiple Facts connected to a single Dimension table? Well as long as the Dimension means the same thing across those Fact tables (and has the same surrogate keys connecting the fact tables), then we can say that Dimension is conformed. E.G. a product dimension connected to both the Web Telemetry Fact and the Sales Fact. In the following diagram we see a Power BI data model with a single product dimension connected to multiple fact tables.
What we can now do is create an analysis using both measures in the sales fact and the web telemetry fact. We can see total sales and total session counts across products, and also create calculations based on those 2 facts too. Conversion Rate % is a measures created using separate fact measures.
We can see the immediate usefulness of conforming the product dimension as it can be used across the different fact tables, as long as the same surrogate keys are used to link the fact tables to the dimension tables. But we have “under the hood” benefits too in terms of only needing to update the dimension attributes in a single place and re-use those across multiple facts. We also mitigate any definition and/or logic drift as we’ve got it all in one place (well, hopefully…depending on the ETL design).
Conforming Source Systems
We’ve talked about a conformed dimension meaning the same thing across multiple facts, but we can actually start conforming earlier in the ETL process. If we have multiple source systems containing the same information, like products or customers, the data may (well in most cases in my experience!) contain different values. E.G. the same product in multiple systems but with different source system codes, names, descriptions etc. So to even get as far as re-using a dimension across multiple facts, we first need to create a distinct list of values from these source systems and de-duplicate and match, so we create one set of dimension attributes. In the case of our product dimension, Product A in one source system could be called Product A-1 in another system. We need to deal with this mapping (or bash the original data entry process over the head) as we load and transform.
Kimball has a section in the Data Warehouse ETL Toolkit book (p155) which shows a database design pattern for creating a conforming module, a design to allow source system values to be mapped together to a single value. E.G. if there are multiple source systems with Products in, perhaps the same product is not named the same, or indeed have the same code, in different source systems. I first saw this in use by David Morrison (TSQLNinja) over a decade ago!
FYI there’s example SQL code in the appendix of this blog for a conforming sub-system I’ve used. Other processes and vendor software are available 🙂
Flow of a Conformed Dimension
Ultimately we could link the definitions together and get an overall flow of a conformed dimension. From the initial mapping of source system attributes together, to the loading of the dimension, then the relationships between the multiple Fact tables. We can also publish the conformed dimension downstream to separate data marts, always making sure that the surrogate keys and domain values in the dimension are the same and mean the same thing.
Terminology…what’s in a name?
What about the actual term “conformed dimension”, I’ll admit that I’ve never had a conversation with a business user and said “oh yes, you need to use the conformed dimensions to analyse those measures together.” I don’t even think I’ve had a conversation with a developer about “conformed” either (I’m sure I have though). So when talking to “the business” what term would you use? I prefer shared (and feedback from Twitter would suggest the same) because it states exactly what is it, a dimension that can be shared. Nice and simple.
In Kimball’s Data Warehouse Toolkit, the authors note that the term conformed dimension is also know as common, master, reference, and shared dimensions (3rd edition p130).
In terms of development and architecture, yes I would still use the term conformed and specify exactly why a dimension is conformed (mapped source system attributes and/or used with multiple Facts.
Look, just go read those dimensional modelling books above! Seriously, stop reading the ramblings in this blog and go read those books 🙂 I have/am/will continue to advocate the use of dimensional modelling when designing data warehouses and data marts, I love the overall process of working through what an organisation wants/needs to measure, and then how they will measure it. I don’t believe the dimensional modelling technique is constrained to technology, it’s a business process that manifests itself in technology.
And every so often I’ll pop up and write about why I value it highly.
Here’s some example SQL code to create a conformed sub-system.
--drop existing tables DROP TABLE IF EXISTS ConformedColumn; DROP TABLE IF EXISTS ColumnDimension DROP TABLE IF EXISTS TableDimension DROP TABLE IF EXISTS SourceSystem --create tables CREATE TABLE SourceSystem ( SourceKey INT PRIMARY KEY, SourceName VARCHAR(255) ); CREATE TABLE TableDimension ( TableKey INT PRIMARY KEY, TableName VARCHAR(255) ); CREATE TABLE ColumnDimension ( ColumnKey INT PRIMARY KEY, ColumnName VARCHAR(255), TableKey INT CONSTRAINT FK_TableKey FOREIGN KEY REFERENCES TableDimension(TableKey), SourceValue VARCHAR(255) ); CREATE TABLE ConformedColumn ( SourceKey INT CONSTRAINT FK_SourceKey FOREIGN KEY REFERENCES SourceSystem(SourceKey), ColumnKey INT CONSTRAINT FK_ColumnKey FOREIGN KEY REFERENCES ColumnDimension(ColumnKey), ConformedValue VARCHAR(255) ); --Insert test values INSERT INTO TableDimension values (1,'DimProduct'); INSERT INTO SourceSystem values (1,'CRM System'); INSERT INTO SourceSystem values (2,'Retail Sales System'); INSERT INTO ColumnDimension values (1,'ProductName',1,'Product A-1') INSERT INTO ColumnDimension values (2,'ProductName',1,'Product B-1') INSERT INTO ColumnDimension values (3,'ProductName',1,'Product AA-1') INSERT INTO ColumnDimension values (4,'ProductName',1,'Product-BB-1') INSERT INTO ConformedColumn values (1,1,'Product A') INSERT INTO ConformedColumn values (1,2,'Product B') INSERT INTO ConformedColumn values (2,3,'Product A') INSERT INTO ConformedColumn values (2,4,'Product B') --generate mapping SELECT S.SourceName, T.TableName, C.ColumnName, C.SourceValue, CF.ConformedValue FROM ConformedColumn CF INNER JOIN SourceSystem S on S.SourceKey = CF.SourceKey INNER JOIN ColumnDimension C on C.ColumnKey = CF.ColumnKey INNER JOIN TableDimension T on T.TableKey = C.TableKey;