Choosing Kong-The API Gateway

12 October 2016

How to containerize a Node.js Application

9 January 2017

How We Zeroed on the Best Data Warehouse

The advancement of technology generally evokes a range of emotions in people from all walks of life. Some choose to see the cons, while others consider it a way to bring the world closer and solve major challenges.
But, as they say, any sufficiently advanced technology is equivalent to magic! Let us walk you through yet another blog post that amazes you with the goodies that technology can offer. A Quick Recap: piStats- Kong-The API Gateway We discussed the pros and cons of using the ‘Kong API’ gateway and discussed the reasons that made us opt for it. ‘Kong’ as we discussed, is the first touchpoint and milestone of a user’s click-event journey – a journey that begins from the user’s browser or mobile, to ‘piStats Realtime’ dashboards. Going Forward: Big Data Management In this blog, we will discuss how to store and retain the enormous data, on a long-term; to derive our historical dashboards with user segmentation. We’ll talk about our data warehouse and the various alternatives that we tried before finally zeroing down to ‘Google BigQuery’. The data that we collect from the user’s click, needs to be stored – over a long term – for historical analysis and determining different user segments. This then feeds into the analytics dashboards or other campaign management systems that help us send push notifications, emails, etc. The data should be stored in such a way that it is easily accessible, without having to wait for long to get the desired results. Getting Started: Why Do We Need a Data warehouse? Apart from long-term storage, a data warehouse also serves as the operational data store to dump data away from the transactional storage, avoid load on the transactional database, and run reports & analyses on historical data. In a nutshell, it acts as a data repository of integrated data, from one or more sources.
Simple, isn’t it? Now, let’s see how we integrated a data warehouse with piStats. piStats and Data warehouse piStats uses data warehouse for two of its main features:

for the historical dashboards
for cohort analysis – used for reporting and push notifications.

The historical dashboard has three different views based on granularity, i.e., Daily, Weekly, and Monthly. It also allows for date selection to get a report within desired range and granularity.
Users can also create cohorts using a user-friendly custom query engine to create desired segments.
Another important feature of piStats, i.e., Retention analysis, also relies on the data warehouse to compute user retention, every 6 hours. And finally, the data warehouse also helps piStats to analyse and fetch data for training the Machine Learning models, build intelligence, and perform predictive analysis, based on the historical data that it fetches. But, what’s so special? We required a data warehouse that could fulfill all the requirements of piStats with zero compromise on performance. In other words, it should support the limitless possibilities of piStats with an eye on:

Scaling and Integration: The data warehouse should be able to scale with the increasing size of the data set. For one of our clients, we added about 150 GB of data every month - we did not want to over provision, neither did we want to change scales frequently.
Time Optimization: We wanted the query performance to be optimal so that the user firing the query from the custom query engine should not be kept waiting for long. By long, we mean, less than a minute!
Cost Optimization: We wanted a pricing model that should charge for the storage and compute levels that we use and not for the default, packaged storage and compute levels that one normally gets with a data warehouse.
Data Storage and Maintenance: Auto-purging was also one of the deciding factors. We wanted to limit data storage, enough to maintain last 3 months of data; rather than keeping the whole data-set persisting. This was a feature that came out-of-the-box in Google BigQuery.

Why Not Redshift?

Storage and Compute Scaling: Redshift (Guide to Redshift Remodeling) provides fixed storage and compute nodes that needs to be scaled (up/down) as per requirement. Where scaling and re-partitioning is a good 4 to 5 hours of job, leaving the system available only for reads, and withholding writes.
Pricing: Every node present in the cluster is priced even if it is not being utilized, or is idle; which, in our case is very relevant. Queries are fired at midnight mostly, and sparsely over the whole day but the load is present throughout 24 hours.
Performance: Since our load was about 800 GB, and cost was a concern, we wanted to keep the nodes as little as possible. Even with partitioning in place, the queries took 4 to 5 minutes to give results, which was quite a bit for a user waiting to get results on the frontend. Increasing the number and size of nodes would have left the storage unused and pushed the costs up.
Data Deletion: Leverage Snowflake's integration with popular data science tools and libraries to fuel your AI and ML initiatives.

Why Not Impala on AWS EMR?

Storage and Compute Scaling: Impala was run on AWS EMR and hence required fixed memory and compute nodes to be added as per scaling requirements. If Impala was given nodes short on memory, it terminated the query with an Out of Memory error, hence being unreliable. The over-provisioned clusters left unused resources, adding to the cost. To avoid cost for the time when the Data warehouse was not used, we usually terminated the cluster. So, whenever a new cluster was provisioned a good 800 GB data was reloaded adding to the time of cluster provisioning.
Pricing: In Impala, as well, one is charged for every node that is added into the cluster - be it on AWS EMR, or an independent Hadoop cluster.
Version: Impala on EMR has no direct support - like other Hadoop ecosystem components namely Spark, Hive, etc. It needs to be installed using bootstrap actions which installs older versions of Impala, resulting in reduced features.
Performance: Impala also took a good 10-11 minutes to return results for the queries and if under provisioned, it would go out of memory without executing the query – needless to say, that was quite frequent in our case!

A Comparative Analysis So, before we knew it, in our attempt to find the best-fitting data warehouse for our requirements, we had tested 2 other alternatives along with BigQuery. Here’s a comparative chart summarizing the technical features of Redshift, Impala and BigQuery:

Metrics	Redshift	Impala(AWS EMR)	BigQuery
Cost	Charged per hour for each instance	Charged per hour for each instance	Charged for storage and data querying only.
Provisioning	Chosen as per storage and memory requirement.	Chosen as per storage and memory requirement.	No notion of hardware, hence no provisioning.
Scaling	Nodes can be added or removed	Nodes can be added or removed	Not required as no hardware.
Reliability	Query executes till completion	Query might give an out of memory error.	Query executes till completion
Query Performance	Limited by the number of CPUs and memory.	Limited by the memory.	Has no limitation and uses maximum resources available to get fast results
Maintainance	Requires periodic vacuum operations	None	None
Partiotioning and Distribution(keys)	Important for Query performance	Important for Query performance	Important to limit the amount of data queried. The query performance fast without partioning as well.
Streaming Data	Not Possible	Requires table to be refreshed.	Allows streaming inserts using APIs
Versions	None	Limited to 2.2 with EMR	None
Bulk Upload	90GB in 4-5 hours	90GB in 1-2 hours with a large Core node	90GB in 4-5 hours
Join Queries	Slow	Limited with the versions available	Fast
Count Distinct	Slow	Allowed but extremely slow. Used NDV instead	Fast
TimeStamp Datatype	Supports different formats	Limited format support	Supports different formats
Deleting specific rows	Allowed	Not supported	Only partitions can expire. Delete not supported.
Update Operation	Allowed but slow	Not supported	Not supported

Alright then, do you now know why we chose BigQuery over the others? Have different opinions? Send it our way by commenting below.In our next blogpost we’ll validate our choice, discuss the ground realities and talk about our experiments with BigQuery. As they say, it is supposed to be automatic, but you must push the button, or scroll and read in this case. Till then, keep innovating and experimenting!

About the Author

Contact Us

RELATED BLOGS

13 May 2024

Tags

How We Zeroed on the Best Data Warehouse

Why Not Redshift?

Why Not Impala on AWS EMR?

About the Author

Contact Us

RELATED BLOGS

Unleashing the Power of Unstructured Data: BluePi’s Data Modernization Solutions for Personalized Banking

Breakthrough Data Architecture: 6 Reasons to Love Snowflake’s Unique Approach to Data Warehousing

Unlock The Power of Your Data: Data Lake vs. Data Warehouse for Smarter Business Decisions (2024)

Unleashing the Power of Unstructured Data: BluePi’s Data Modernization Solutions for Personalized Banking

Breakthrough Data Architecture: 6 Reasons to Love Snowflake’s Unique Approach to Data Warehousing

Unlock The Power of Your Data: Data Lake vs. Data Warehouse for Smarter Business Decisions (2024)

Star Schema vs Snowflake Schema: find out the Warehouse model that is right for you