数据仓库建模：Inmon与Kimball方法的对比

Here, the data mart effectively becomes the data warehouse.

Although both methodologies have nuanced differences, they ultimately share many similarities. Personally, I lean towards Kimball’s approach. It allows for staging data and then normalizing from that point, avoiding the replication of the OLTP source system’s structure beyond the staging phase.

Drawing from the early Trivaids best practice architecture (as detailed in Data Warehouse Blueprints- Business Intelligence in der Praxis (Dani Schnider, Claus Jordan, Peter Welker, Joachim Wehner)), I’ve found that leveraging the strengths of both Inmon and Kimball’s methods can be highly effective.

# Core of Kimball according to People from the Field

The “core” of Kimball from Josh Andrews:

Users will need to query data in any number of different ways, but fully normalized data models are way too complex for non-developers to deal with. So we need a simpler way to model data for analytics.2We need to think about dimensions in a conformed way. if you are slicing a bunch of facts by, say, customer, it needs to be the same, common list of customers no matter how many different things you slice it by. Otherwise users across the business eventually start getting inconsistent results and then everybody loses faith in the accuracy of the data in the warehouse. And no, denormalizing all of your dimensions down into every fact is not a good solution when you have 30 facts and 30 dimensions in a DW of respectable size.We need to model data in such a way that when we add new data it doesn’t break existing reports and analytics. This means making sure we think thoroughly about the grain of all things ahead of time. If we change the existing columns or the grain, especially the grain, it’s going to break a lot of existing stuff. Well, maybe you can change things if you work in a small shop, but in an Enterprise Data Warehouse with hundreds or thousands of users, getting everybody to change their existing analytics takes literally years.But, after #3 - we are also going to constantly find new data and reports. So we have to make sure that we have modeled in a way that allows us to add new data.We need a consistent conceptual way to get people to think about events we analyze (facts) versus the things we are analyzing them by so that we aren’t reinventing the wheel every time we go to build data models. I can tell everytime I see a DW or analytics solution built by somebody who doesn’t understand this and doesn’t follow some approach (whether Kimball or Inmon) because it’s invariably some crazy denormalization approach that is internally inconsistent and impossible to explain in an organized way. Or even worse they’ve just imported all the fully normalized datasets from the source and point the reports right at the source tables.Users will invariably want to use multiple different BI and query tools to access the same dataset. And you can try to herd them in a certain direction (like Looker) but you’ll never get them all. So the DW should be built so that, to the best extent possible, users can still query with the tool of their choice. Which means not relying on BI tools, like Looker, to perform major further transformations of the data in the BI layer. Minor transformations, documentation, labels, etc.? Sure.Database performance is much better now than it was when Kimball was written , but Kimball’s model of conformed dimensional design was not primarily concerned about database performance. I will admit it is helpful in managing performance and he discusses performance and storage space, but that was a secondary benefit and regardless of that the primary use cases for it are still relevant.

Follow up from Justin Lane on Slack:

I find it useful to recognise the difference between logical and physical data modelling. Developing a Kimball model can be very beneficial to help organise thinking/design activity around the concepts of dimensions and facts. This is especially useful when you have a team of people working on the development/maintenance of your warehouse. The logical model can help to establish a shared understanding and facilitate discussion. Separating dimensions from facts helps to achieve this. The physical implementation can be optimised for the target platform(s) and you may end up with flat/wide/denormalised structures. These are (should be) the physical implementation of your logical model.

And follow up by Kosti Hokkanen here:

see also Classical Architecture of Data Warehouse