Data Architecture 101, Part 3:
posted by Mosaic Data Science
Data marts, data warehouses, and some operational datastores use dimension tables. A dimension table categorizes a fact table that joins to the dimension. At query time one filters the facts by values in the dimension table, and uses those values to label the query results. For example, four dimensions in Figure 2 of our second data-architecture post “Overview of Relational Architectures” categorize a sale line-item fact. One of the dimensions is location, which we defined so:
create table location_dim
Figure 1: Location Dimension Create Statement
So we could query the line-item fact table by location name like this, to get total sales for the Denver location (over all time):
sales_line_item_fact.location_dim_id = location_dim.id and
location_dim.name = ‘Denver’;
Figure 2: Sample Star-Schema Query
The example location dimension only has three attributes. Dimensions may have many attributes. Let’s enrich the location dimension so that it records the locations’ state, the type of retail location, the number of departments at the location, the location’s distance from the nearest major metropolitan center, the average population in the location’s primary market, and the average household income in the location’s primary market:
create table location_dim
Figure 3: Enriched Location Dimension Create Statement
There; that’s a much more interesting, thought-provoking dimension.
As an aside: SQL has two kinds of statements: data-definition language (DDL) and data-manipulation language (DML). Create statements are DDL; the CRUD operations (including insert, select, update, delete, and merge, which is an important statement for BI) are DML. You’ll often see the acronym DDL without explanation in relational-database contexts. Now you know.
An obvious question about the enriched dimension above is, aren’t four of the new attributes (those having numeric types) measures? Our answer is, no, those are attributes. Some BI authorities insist that dimension tables should only contain discrete-valued attributes. We disagree (and we think this is a great example of dogma overtaking good judgment in BI, which is all too common). The majority of dimension tables have small row counts, compared to the fact tables they categorize. For example, even a large retail chain might have a few thousand locations. If the location table is versioned, the table might contain a few tens of thousands of rows. That’s a small table in the relational-database world—in which case we can safely record the exact values of these four attributes, and run queries that match ranges of attribute values. If a dimension table has a much larger row count, we might want to add some columns discretizing the real-valued attributes, like this:
create table location_dim
Figure 4: DDL for Location Dimension with Discretized Attributes
Notice that we left the exact-value columns in the dimension as well. It’s not uncommon for dimension tables to have several dozen columns, in part because of this sort of denormalization. (Very wide tables can present performance problems, but BI schemas usually use large block sizes in anticipation of long rows, to make sure every row fits within a single block. We’ll have more to say about BI-schema optimization in subsequent posts.) This lets us query by bucket to look quickly for patterns, and then (if something looks promising) do a more precise analysis with the exact values.
Dimensions can be versioned in several ways. Dimensions are often identified as being of type n, where n is a non-negative integer, to indicate the versioning strategy that the dimension uses. The problem with this typing scheme is that strictly speaking, it applies to columns, not tables; and we’ll describe them this way, even though most BI folks speak of e.g. type-two dimension tables. The first four types are defined as in Table 1 below:
|0||Do not change the column’s value, and do not insert another row with the same natural key and a different value in the column. (The column’s values are set in stone.)||Natural primary key columns should never be updated. For example, a city dimension table should never update its city-name or state-name columns, because (presumably) these columns together are the natural primary key.|
|1||Always overwrite the old value with the new value.||You might overwrite a city dimension’s current_mayor_full_name column.|
|2||Insert a new row with a different surrogate primary key and the same natural key.||A sales application might categorize cities as tertiary, secondary, or primary markets; and the business might want to track sales transactions according to the market type of their cities. One way to do this would be to make the city dimension type two.|
|3||Include as many versions of the column as you want.||You might track a city’s original and current market types in original_market_type and current_market_type columns.|
Table 1: Dimension Versioning Types
There are other types of versioning schemes, but you probably won’t need them.[i] Mostly dimensions (the whole tables) are type one and type two, except that the natural-key columns are type zero.
There are many special-purpose dimension-table design patterns. Here are some of the more common:
Group dimension: group dimensions are really a kind of snowflaking that is generally considered unavoidable even in data warehouses that try to adhere strictly to the star-schema pattern. The fact table joins to a group dimension, which has one row (possibly versioned) for each group. The group table joins to a bridge table, which joins to an outrigger group-member dimension table. If a member can belong to several groups, the bridge table captures a many (groups) to many (members) relation. If a member can belong to at most one group, the relation is one (group) to many (members).
Junk dimension: a dimension table that contains every possible combination of values from several dimension entity types, each of which has a small set of possible values. For example, if you have five dimension entity types that are all Boolean flags, you could combine them into a single junk dimension containing five char columns, each with two possible values, and with 25 = 32 rows. Then a fact table would have foreign-key column referencing the junk dimension, instead of five foreign keys to Boolean dimensions (or five Boolean measure columns).
Degenerate dimension: a column in a fact table that functions as a dimension by categorizing the fact, but that is not a foreign key. Often degenerate dimensions are foreign keys referencing a parent transaction, when the fact table containing the degenerate dimension is a line item. Thus we could add a sales_fact_id column to our sales_line_item_fact table, to facilitate aggregating or grouping over sales transactions.
Role-playing dimension: a single fact table may reference the same dimension several times. For example, an accumulating snapshot fact table may have a date foreign key for each step in the business process represented by the snapshot. One can create a view on the date dimension for each date foreign-key column, so that queries on the fact table do not (directly) reference the date dimension several times.
Audit dimensions: ETL and other BI/DW processes should have audit trails. The most robust way to capture an ETL audit trail is in a set of star schemas designed for that purpose. Facts in such star schemas might record, for example, the occurrence of specific data-quality problems. The dimension tables in these star schemas would categorize audit-trail facts by e.g. date and time, data source (provenance, destination table and column, and data-quality problem type.
A common mistake in BI schema design is to represent hierarchy within a dimension using multiple tables (generally snowflaked). Consider the example of geography: cities roll up to states (or their foreign analogs), which roll up to countries. In theory you could represent these as three dimension tables, and a fact table could have a foreign key to each of them. A more elegant approach is to have a single city dimension table, with name columns for city, state, and country:
create table city_dim
Figure 5: A Dimension Table Containing a Hierarchy of Logical Dimensions
Now you can filter by whichever level of the geographical hierarchy you want, using a single table. Rollups on geography are also straightforward using this approach. (We’ll leave to the reader the exercise of writing a sample SQL statement doing a rollup.)
There are a variety of technical problems involved in modeling dimensions correctly, especially when the dimensions are large (have hundreds of thousands, or millions, of rows). The Kimball Group’s blog posts on dimension-table core concepts and advanced dimension-table concepts are great resources for learning the fine points.
[i] If you just have to know, see http://www.kimballgroup.com/2013/02/05/design-tip-152-slowly-changing-dimension-types-0-4-5-6-7/.