Change Data Capture and SQL Server Analysis Services
I'm designing a database application where data is going to change over time. I want to persist historical data and allow my users to analyze it using SQL Server Analysis Services, but I'm struggling to come up with a database schema that allows this. I've come up with a handful of schemas that could track the changes (including relying on CDC) but then I can't figure out how to turn that schema into a working BISM within SSAS. I've also been able to create a schema that translates nicely in to a BISM but then it doesn't have the historical capabilities I'm looking for. Are there any established best practices for doing this sort of thing?
Here's an example of what I'm trying to do:
I have a fact table called Sales which contains monthly sales figures. I also have a regular dimension table called Customers which allows users to look at sales figures broken down by customer. There is a many-to-many relationship between customers and sales representatives so I can make a reference dimension called Responsibility that refers to the customer dimension and a Sales Representative reference dimension that refers to the Responsibility dimension. I now have the Sales facts linked to Sales Representatives by the chain of reference dimensions Sales -> Customer -> Responsibility -> Sales Representative which allows me to see sales figures broken down by sales rep. The problem is that the Sales facts aren't the only things that change over time. I also want to be able to maintain a history of which Sales Representative was Responsible for a Customer at the time of a particular Sales fact. I also want to know where the Sale Representative's office was located at the time of a particular sales fact, which may be different than his current location. I might also what to know the size of a customer's organization at the time of a particular Sales fact, also which might be different than it is currently. I have no idea how to model this in an BISM-friendly way.
You mentioned that you currently have a fact table which contains monthly sales figures. So one record per customer per month. So each record in this fact table is actually an aggregation of individual sales "transactions" that occurred during the month for the corresponding dimensions.
So in a given month, there could be 5 individual sales transactions for $10 each for customer 123...and each individual sales transaction could be handled by a different Sales Rep (A, B, C, D, E). In the fact table you describe there would be a single record for $50 for customer 123...but how do we model the SalesReps (A-B-C-D-E)?
Based on your goals...
- to be able to maintain a history of which Sales Representative was Responsible for a Customer at the time of a particular Sales fact
- to know where the Sale Representative's office was located at the time of a particular sales fact
- to know the size of a customer's organization at the time of a particular Sales fact
...I think it would be easier to model at a lower granularity...specifcally a sales-transaction fact table which has a grain of 1 record per sales transaction. Each sales transaction would have a single customer and single sales rep.
FactSales DateKey (date of the sale) CustomerKey (customer involved in the sale) SalesRepKey (sales rep involved in the sale) SalesAmount (amount of the sale)
Now for the historical change tracking...any dimension with attributes for which you want to track historical changes will need to be modeled as a "Slowly Changing Dimension" and will therefore require the use of "Surrogate Keys". So for example, in your customer dimension, Customer ID will not be the primary key...instead it will simply be the business key...and you will use an arbitrary integer as the primary key...this arbitrary key is referred to as a surrogate key.
Here's how I'd model the data for your dimensions...
DimCustomer CustomerKey (surrogate key, probably generated via IDENTITY function) CustomerID (business key, what you will find in your source systems) CustomerName Location (attribute we wish to track historically) -- the following columns are necessary to keep track of history BeginDate EndDate CurrentRecord DimSalesRep SalesRepKey (surrogate key) SalesRepID (business key) SalesRepName OfficeLocation (attribute we wish to track historically) -- the following columns are necessary to keep track of historical changes BeginDate EndDate CurrentRecord FactSales DateKey (this is your link to a date dimension) CustomerKey (this is your link to DimCustomer) SalesRepKey (this is your link to DimSalesRep) SalesAmount
What this does is allow you to have multiple records for the same customer. Ex. CustomerID 123 moves from NC to GA on 3/5/2012...
CustomerKey | CustomerID | CustomerName | Location | BeginDate | EndDate | CurrentRecord 1 | 123 | Ted Stevens | North Carolina | 01-01-1900 | 03-05-2012 | 0 2 | 123 | Ted Stevens | Georgia | 03-05-2012 | 01-01-2999 | 1
The same applies with SalesReps or any other dimension in which you want to track the historical changes for some of the attributes.
So when you slice the sales transaction fact table by CustomerID, CustomerName (or any other non-historicaly-tracked attribute) you should see a single record with the facts aggregated across all transactions for the customer. And if you instead decide to analyze the sales transactions by CustomerName and Location (the historically tracked attribute), you will see a separate record for each "version" of the customer location corresponding to the sales amount while the customer was in that location.
By the way, if you have some time and are interested in learning more, I highly recommend the Kimball bible "The Data Warehouse Toolkit"...which should provide a solid foundation on dimensional modeling scenarios.
The established best practices way of doing what you want is a dimensional model with slowly changing dimensions. Sales reps are frequently used to describe the usefulness of SCDs. For example, sales managers with bonuses tied to the performance of their teams don't want their totals to go down if a rep transfers to a new territory. SCDs are perfect for tracking this sort of thing (and the situations you describe) and allow you to see what things looked like at any point historically.
Spend some time on Ralph Kimball's website to get started. The first 3 articles I'd recommend you read are Slowly Changing Dimensions, Slowly Changing Dimensions Part 2, and The 10 Essential Rules of Dimensional Modeling.
Here are a few things to focus on in order to be successful:
- You are not designing a 3NF transactional database. Get comfortable with denormalization.
- Make sure you understand what grain means and explicitly define the grain of your database.
- Do not use natural keys as keys, and do not bake any intelligence into your surrogate keys (with the exception of your time keys).
- The goals of your application should be query speed and ease of understanding and navigation.
- Understand type 1 and type 2 slowly changing dimensions and know where to use them.
- Make sure you have a sponsor on the business side with the power to "break ties". You will find different people in the organization with different definitions of the same thing, and you need an enforcer with the power to make decisions. To see what I mean, ask 5 different people in your organization to define "customer" or "gross profit". You'll be lucky to get 2 people to define either the same way.
- Don't try to wing it. Read the The Data Warehouse Lifecycle Toolkit and embrace the ideas, even if they seem strange at first. They work.
- OLAP is powerful and can be life changing if implemented skillfully. It can be an absolute nightmare if it isn't.
- Have fun!