All about Redshift
Understanding the Redshift Architecture: Redshift cluster is made up of 1 leader node and 1 or more compute nodes.
- Leader Node:
- Leader node interacts with client applications through various protocols like JDBC and ODBC
- Leader node coordinates work of the compute nodes under its hood
- It also prepares the query execution plan to have optimized query execution and sends the same to compute nodes
2. Compute Node:
- Compute Nodes each have own CPU, memory and disk. One can actually configure, how powerful cluster should be configured
- We can either scale up i.e. increase the CPU, memory and disk size of the compute nodes and have lesser nodes but more powerful one, or scale out i.e. have medium power high number of nodes
3. Node Slices
- Each compute node is logically divided into slices
- A slice has its own memory and disk space and all slices in a compute node work in parallel to handle the workload
- A cluster with N slices can process N partitions of a table
- So, if you have a table with a distribution key set on some column, the table partitions are sent to each of these node slices for processing
Hence, Redshift is a relational database management system (RDBMS) based on PostgreSQL, so it is compatible with other RDBMS applications. Although it provides the same functionality as a typical RDBMS, including online transaction processing (OLTP) functions such as inserting and deleting data, Redshift is optimized for high-performance analysis and reporting of very large datasets. Redshift integrates with various data loading and ETL (extract, transform, and load) tools and business intelligence (BI) reporting, data mining, and analytics tools.
Optimizing Redshift: There are two ways to optimize redshift query performance. One is to apply a distribution style and another is to define a sorting key.
- Even Distribution
- In this distribution style, data of a table is distributed evenly on a column and is spread across all slices to achieve load-balancing
- This approach is good if the table won’t be joined.
- But if this is large table like a fact table and it needs to be joined with some small dimension table, then this approach can lead to performance issues
- This is because data would need to be shuffled across the slices to perform the join
2. All Distribution
- This approach is used to overcome the above problem
- Dimension tables being small mostly are replicated across slices
- This leads to broadcasting of tables, and join operations become less costly
- This approach comes with tradeoff of storage as same dimension table is replicated multiple times. But if the table is small, this tradeoff is fine
3. Auto Distribution
- Here Redshift decides on the distribution strategy
- Small enough tables are distributed with all strategy and large tables are distributed evenly
4. Key Distribution
- Rows with similar values are placed in same slice
- This can lead to skewed distribution of data across slices
- But this can be useful if dim table is not small enough and ‘All’ style cannot be applied. In that case both fact and dim tables are distributed on same keys i.e. joining keys
- Then Redshift collocates rows from both tables on same slices
- One can define columns of a redshift table as a sort key
- Upon loading, rows are sorted before distribution to slices
- It minimizes the query time as rows are ordered in a continuous manner on the slices
- This is helpful when frequent queries like sorting the date dimension and corresponding fact tables are to be made
When to use Amazon Redshift
- If you want to move your DW to cloud, then Amazon Redshift is a very good option
- Amazon Redshift is a columnar database. As a columnar database, it is particularly good at performing lot of aggregations per column. This is especially true when you’re querying through large amounts of data to gain insights against your data, such as when performing historical analysis, or even when creating metrics for your recent application data.
- When you want to build ETL pipeline within the database, Redshift is very good option
- Amazon Redshift is an MPP database. MPP stands for Massively Parallel Processing. Efficient implementation of columnar storage algorithms and data partitioning techniques give Amazon Redshift an edge in terms of performance.
- AWS allows to scale up or scale out Redshift making the solution scalable for data engineering teams
- If your data engineering team has SQL/PostgreSQL expertise, Redshift can be used as it is based on PostgreSQL
Some Cons of Redshift
- It requires good understanding of sort keys and distribution keys. Sort keys and Distribution keys decide how data is stored and indexed across all Redshift nodes. Therefore, you need to have a solid understanding of these concepts and you need to properly set them on your tables for optimal performance. There can be only one distribution key for a table and that can not be changed later on, which means you have to think carefully and anticipate future workloads before deciding Dist key.
- While Redshift is very fast when running queries on a huge amount of data or running reporting and analytics, but it is not fast enough for live web apps. So you will have to pull data into a caching layer or a vanilla Postgres instance to serve Redshift data to web apps.