Brutal DDoS Attack – Is it the Beginning of a Bleak Future?
2 February 2017How to automate copying an EC2 AMI from one region to another
31 May 2017 Published by
BluePi
Data-Driven Business Transformation
Guide to Redshift Remodeling- Friends and Foes
Schema Design
Proper dimensional model is an absolute need for Redshift to perform well. Three different dimensional models work best with Redshift:
- The Snowflake
- The Star
- Flat : To understand the differences between the three modeling schemes let’s get some terminology out of the way. There are two types of tables that we create in a dimensional model Facts and Dimensions. The fact tables are the measures, while the dimension tables contain descriptive attributes. So, the facts usually have numbers and we use some numeric operations like sum, count, average on them. On the contrary, the dimensions are used to filter or categorize the data. The number of unique, in a dimension, would always be an order of magnitude smaller in comparison to the number of facts. The difference between Star and Snowflake Schema manifests in the ‘normalization’, we do in the dimensions. In the Snowflake schema, the dimension tables are d iteratively whereas in Star there is only one level of normalization. Examples So, let’s visualize this with two tables Sales fact and Store dimension table. The Sales fact stores the amount of product sold from each store. Star Schema
Here store and product are dimension tables while sales are the fact table. Data in dimension tables do not change frequently and as mentioned above, contains the descriptive attributes. Here, the store table stores the outlet information while the product is the collection of all types of products produced by the company. Fact table sales, rather stores every transaction or sale of a product made by the individual store, so it is highly dynamic in nature and usually the heaviest one as well. Note all the columns in sales are of integer type, which makes it easy to do aggregations, indexing for queries to run faster. Also, fact tables must be distributed across different redshift clusters by either stores or products to gain huge improvements in query performance. Snowflake Schema
Do you see a problem with Star Schema? There is one, dimension table “store” is not normalized. That means if there are 1000 stores in India, country ‘India’ would be repeated 1000 times. Same is the case with city and state. This makes dimension tables unnecessary bulkier. While denormalizing extensively increases the number of joins to fetch the data, which adds to computing time and adds to query execution. Flat Schema There exists another model i.e. flat model, which is essentially another level of denormalization over the star schema. The state, city, country and store all get folded into the fact table. Here are the pros and cons of the three modeling choices.
The three approaches compared
Type | Normalization | Ease of Query | Storage size | Joins |
Snowflake | High | Difficult | Low | Many |
Star | Medium | Medium | Medium | One |
Flat | Low | Easy | High | None |