Kimball DW Lifecycle Training

Kimball Data Warehouse Lifecycle – Warren Thornthwaite

This is a summary of my notes from a session I took at the 2011 TDWI World Conference.  This sessions was a last minute change for me.  I originally planned on attending the Exam Prep course for the certification tests that I was taking during the conference, but after having a chance to study the exam guide and some other materials in the subjects I was weakest in, I decided that a day about the DW lifecycle would be far more valuable.  It turned out I was right!  I passed the exams at the level I wanted and I learned a heck of a lot from Warren Thornthwaite.

Since I haven’t read through the book, it was nice to get the full overview from one of the experts and be able to catch a lot of insight about how to be succesful in the field of data warehouse development.  One of the things that was mentioned early in the course was the focus on determining what would make a difference in the enterprise, delivering the highest impact product (if feasible), and then measuring if it made a difference.  The keynote already had me thinking about what questions should be asked for our next project to make sure we have measurable success criteria, but Warren’s comments gave a very simple approach to keep in mind.  If a postive difference is made you have succeeded, of course whether the project was worth the cost is a whole different question to keep in mind. 

One of the other things I found very helpful was the discussion on how to conduct interviews.  He seemed to be talking about two types of interviews: interviews to determine the highest priority enterprise needs in order to plan which project comes first and interviews to figure out what the requirements are for a project.  A few key things I picked up were:
  1) Interview individually or in very small groups so you can get the most honest feedback.
  2) Focus on what the person is doing now for their job, not necessarily what they want. 
       This usually helps ensure you understand the process you are helping the business 
       analyze and aren’t missing things that are currently needed.
  3) Do some type of formal documentation and interview write up for each interview.

All in all, the course was a great overview for me of concepts I have already studied and come across in my daily work.  Many specific things that were brought up I have not actually done the way I have been taught, so it was a great time to think through why certain things like a DW Bus Matrix and inteview documents are no where to be found within my team.  Some of the information on architecture design could be useful when I review them in the future, but these were my initial take aways from this course.

Posted in Data Warehouse Design, Training | Tagged , , | Leave a comment

TDWI Conference 2011 – San Diego

I was fortunate to attend the TDWI conference last week.  This was my first time at the conference and I was very impressed with how useful the sessions are.  Along with attending the sessions I also studied in the month leading up to the conference and was able pass the CBIP exams.  Below are my summarized thoughts on what I got out of the conference.

  • Monday Keynote – Ken Collier

This presentation was the best I’ve heard on what you should know about Agile Data Warehousing.  In all fairness, I still need to watch the keynote Ralph Hughes did later in the week since I couldn’t make it down that day.

Key takeaways for me:

      1. Agile Manifesto – focus is on responding to change over following a plan
      2. Agile teams are most effective when trusted to get things done without micro-management.  This was especially helpful to me since I am adjusting to a supervisor role within the development team.
      3. Advice for agile managers – help with containing scope and maintaining clarity on overall goals
      4. Primary measure of progress is working software
      5. Task completion is not the measure of success
      6. What matters in Agile:
            1. Delivery of software that adds value
            2. Quality of software
  • Sessions

I attended 3 days of sessions which were all very informative and much of the information can be translated directly into my day to day work, especially as we kick off the next project.  The sessions I attended are Kimball DW Lifecycle, Intermediate and Advanced Dimensional Modeling, and Requirements Management for Agile DW.  These were all exceptional classes taught by top experts in the field and I want to provide more information on the value I received out of them.  Therefore, I plan to write out my thoughts on each of these in separate posts.

  • CBIP Exams

The exams I took were very tough.  The exam guide that TDWI sells was very helpful in getting me in the right mindset for how broad the range of questions would be and what types of areas I was expected to know.  The IS Core and Data Warehouse exam that everyone is required to pass for the certification were challenging but valid tests of someones understanding of industry terminology and concepts.  Some questions definitely require answers that are the TDWI definition and are confusing to me based on my experience, but there is still value in knowing how others define certain key concepts.  That being said, by reading through some TDWI class notes and the exam guide, I was able to pick up the extra information I needed for the exams.  The third test I took was the Data Management specialty exam.  This was by far the most challenging and frustrating.  There were a lot of questions related to data governance, master data management, and many other very different parts of data management that are not closely tied to what a data warehouse developer does.  Thankfully I have been involved in more discussions about master data management and data governance at my current job, including some webinars that helped me learn a bit of the terminology.  I was able to pass the exams at the mastery level, so I can’t complain.  I do think the preparation helped me in gaining more knowledge on aspects of data warehousing that I am less familiar with.  I believe this will help me in communicating with other individuals within my company and also help when trying to hire people who come from different backgrounds.

Posted in Training | Tagged , , | Leave a comment

Data Quality In SSIS

I recently received a comment asking for more details on how we implemented data quality tracking using SSIS.  Since I’ve been meaning to post a new entry for a while, I figured I’d take this opportunity to try and give a brief explanation of how we did it.  As usual, some of this was built by other members of my team, but I’ll try to give some insight on what was needed to build in data quality.

SSIS Components

To build data quality into every SSIS package without rebuilding the logic every single time, we used a generic data quality package that we call after every table load.  Since not every table we load needs screens applied, we have configurations set up to in each package to set the execute package task property ”Disabled” to true or false.  All the data quality screens are SQL statements that we store in a database table, and we call the correct ones based on which table was just loaded. In the screenshot you can see two disabled (grey) control flow items.  The first one uses the source table name and destination table name to pull out the corresponding id that needs to be passed to the screens list.  In this SQL task we actually hard code the variable values that make up the table names (@database, @schema, @table), so these must be maintained for each separate table load sequence.  The “Data Quality Check” package that is called has a set of parent package variable configurations to accept the source and destination ids that were determined by the “Get Screen IDs”.  The Data Quality Check package then pulls back a list of screens that have the source and destination identified.  Once the list of screens is found, a for loop iterates through each one and calls a stored procedure that runs the SQL contained for the screen.  If the screen has an action to do when a quality event is found, then we pass on the action SQL logic (also stored in our screens table).

Screens List

For our quality system to work, we had to have a list stored that contains a quality check sql statement for each screen.  Some screens also contain an action sql statement for data cleanup, normally an update and sometimes a delete from the staging table.  As mentioned above, we also have information about the source and destination so we can identify which screens should be run after a table is loaded.  While the screens list ultimately resides in a table in our staging database, we keep the list updated by populating the table from a sharepoint list where we enter the data.  We also populate quality dimensions using data input to our sharepoint list.

While this still does not cover a ton of detail, hopefully it gives a little more insight into the actual implementation.  If anyone has questions on a specific piece of the quality system, please comment and I will try to elaborate.  The stored procedures used will vary based on how you manage your screens and what attributes you want to track for your quality screen executions and quality events.

Posted in Data Quality | 1 Comment

Introduction to Data Mining

I recently attended a SAS course called “Data Mining Techniques: Theory and Practice”.  This course was taught by Michael Berry of Data Miners, Inc.  Michael and his colleague Gordon wrote some well known books including the one I received a copy of as part of the training, Data Mining Techinques for Marketing, Sales, and Customer Relationship Management.  This was my first introduction to actual data mining theory and this course was a great place to start.  The rest of this post is an overview of some of the key takeaways that I expect to apply to my work, but certainly there is a lot more information than I can include here.  I have not put any of these techniques into action yet, so feel free to add some comments if I say something incorrectly.  Also, look forward to more posts about my real experiences in prepping data for data mining and hopefully even creating some models at least for my own practice.

The techniques discussed in the training were Regression, Decision Trees, Neural Networks, Memory-Based Reasoning, Clustering, Survival Analysis, Market-Basket Analysis, Link Analysis, and Genetic Algorithms.  One of the really interesting components of the training was discussing the difficulty in defining the right business question and then getting the correct transformed data set together.  I hope to post more in the future on some common transformations that happen in prepping data, but for now I will discuss the techniques I expect to be most common in my type of work: Regression and Decision Trees, along with some data exploration done by Clustering.

Regression models are one of the most common techniques and the concept is familiar to people who have worked with statistics and analytics.  Basically you are building a model which forms a best fit line for the initial data set used in development (called the “training” data).  Once this best fit line is created, you can predict future values by passing the known attributes (such as month and product type) into the linear equation and seeing at what point on the line the target value fits.  With regression models you want a small number of input variables and determining these can be done by choosing known key factors, adding variables one at a time based on which one performs best on it’s own, or using a decision tree to determine which variables are most predictive.  Regression models are really good if working with a continuous target variable, such as a dollar amount or percentage, rather than a binary result, such as ‘Returning Customer’ vs ‘One Time Purchaser’.  Some of the advantages of this method over others are that regression is usually the easiest model to understand, can usually be modeled without complex data mining software, and it finds global patterns based on only a few inputs.

Decision trees are rules set up in a tree structure, similar to a data flow diagram.  One major advantage of decision trees is you can visualize which attributes are used to determine the target value.  Decision trees can also be used for data exploration, even if the final model does not utilize a decision tree for predictive purposes.  This technique works very well if there are multiple paths to the same target value.  A key thing to remember when using decision trees is that once you determine the best model on your training data you need to run a separate validation data set against the model.  The validation set will let you find specific rules that were strong in your training data but not strong in the validation set.  Once you have run the model with your validation data, you want to prune the tree so that the final decision point is a good fit for both test and validation data, this way you aren’t building a predictive model that really only works for the initial data set.  Also be careful to normalize values that might change frequently so that your model will continue to work in the future (this concept applies to all techniques).  It’s also important to make sure each leaf of your tree is correct for a reasonable number of records — meaning in most cases don’t allow leaves to be created that only represent 5 records in your training data, since there is probably a less granular level that can give you better predictive results.  Decision trees are best when the target variable is a classification and they have the benefit of being able to work with a large number of input variables.

Another useful technique is clustering.  There are a variety of algorithms that can be used, but the concept is fairly simple.  Clustering is a form of undirected data mining where you are not specifying a target variable but simply looking for groups of records with similar characteristics.  It is a good way to take a large data set and break it down to relevant segments.  Sometimes the segments that are created are best suited as input variables into a different type of model (such as regression or decision tree models), but you can tell a lot just by looking at the means of numeric values.  So while clustering is a very useful technique, it doesn’t normally lead to any predictive model.  It will, however, tend to lead to a lot more questions about how the segments vary.

A couple other techniques that I recommend reading about further (and I plan to try out in the future) are survival analysis and market-basket analysis.  Survival analysis is a way to model time-to-event problems, usually customer focused questions such as how long it will take for a customer to leave or stop making purchases.  Market-basket analysis (or association rules) relate to what behaviors tend to occur together.  The basic concept is answering the retail question “Which items are purchased in the same transaction?”  A way association rules may be useful in my work is to help define data quality checks based on pairings that are highly unlikely.

Hopefully this is a good overview of what I think will be useful.  In the future I should be posting with a more detailed understanding after I have created a few practice models, most likely with Microsoft Data Mining tools.

Posted in Data Mining | Leave a comment

Data Quality – High Level Overview

I recently presented to our data warehouse stakeholders about our approach to building a data quality process into the ETL.  The approach is based on the suggestions in chapter 4 of The Data Warehouse ETL Toolkit, one of the great books by the Kimball Group.  The main technical advantage we used from the book is the concept of storing the detailed logic (SQL code) that is needed to run the quality checks, so that we can have each SSIS package automatically run all the screens that apply.  This means there is no change to ETL code when we want to add or remove a quality screen, which is a great advantage for our team because moving changes in ETL to production is not a 5 minute step.  However, this post is to share the key points of data quality and the process we are using, so I will save the technical details until a future post.

First, let’s talk through some definitions that help us understand the key components of our data quality process.

Data Quality degree of accuracy in reflecting reality
Quality Screen data check to determine if a value meets criteria for accuracy (business rules)
Quality Event failure to pass Quality Screen check
Data Steward person responsible for reviewing quality of a data set and responding to quality events

The data quality process is in place to ensure decisions are made based on data that is:
1. Correct – it accurately reflects reality  (not just reflects what someone typed in)
2. Clear - it is easy to understand and we don’t have two values that are hard to determine the difference between
3. Conformed – we don’t use two values that mean the exact same thing, a classic example is one table representing gender as “M” or “F” while another table uses “Male” or “Female”
4. Complete - there are no missing records and blank values are replaced by a value that can be filtered on

The strategy used to improve data quality includes:
1. Reporting results of quality screens
2. Identifying the severity of a quality concern
3. Cleaning values that would cause inaccurate reporting
4. Tracking overall trending of data quality over time

As displayed in the process flow image below, the data quality process runs data screens at a few steps in the data load process. First, we run screens immediately after extracting data.  This will record issues that exist in specific source system tables and we can report back so that the system administrators can fix any events that represent an error in the source system.  Next, we run screens after source data has been conformed and integrated with other source system tables.  This will record both data issues that existed in source system and issues that may not be source system errors but cannot be loaded as-is to the data warehouse.  As screens run they are recorded in the Data Quality Data Mart and we can then create reports from this data mart.  Normally if a screen reveals an error at the data source, the data steward will be made aware of the event and can notify appropriate people to correct the source data.  We strive to fix data issues as close to the source as possible, but we must change or reject any data that would cause an error in the data warehouse.

 

Posted in Data Quality | 2 Comments