The dominant criterion in the performance of a Redshift and cloud DWH is generally the speed of data communications:
- How rapidly do Redshift nodes receive data from Redshift's managed S3 storage?
- How rapidly are data from individual tables exchanged between cluster nodes?
So, regardless of whether you work with a cluster or Redshift Serverless, reduction in I/O load is always key! I/O requirements play a major role in database sizing.
The size of a DWH thus depends on the following factors:
- Volume of compressed data (requiring processing), usually in addition to query magnitude and response on the largest fact tables, i.e., if only 5% of such facts are usually queried, then sorting by the time attribute, for example, helps significantly reduce I/O.
- Requirements concerning the duration of daily or hourly processing. The following standard requirements are meaningful here:
- A daily process should take no longer than 2h - 4h to allow any processes aborted in the morning to be restarted after troubleshooting, thus preventing loss of an entire day.
- For hourly processing to make sense at all, the involved processes should run for just a few minutes each. The requirement here is to process the freshest possible data and keep reporting up-to-date.
- Processes needed at intervals of minutes/seconds, i.e. streaming processes, should also be able to run correspondingly fast to prevent data backlogs.
This necessitates an understanding of data use cases (requirements), query patterns and changes in data volume over time in order to enable proactive adjustments where necessary.
- Factors influencing query performance for end-users / BI tools
- Complexity of SQLs and user expectations in the areas of reporting and ad-hoc analytics: >95% of all queries should be within the range of seconds, while the remaining 5% (if they do not run constantly) may also take minutes.
- Number of queries & concurrent users
- Long-running queries (e.g. month-end closing statement) which are executed very rarely and should not affect sizing.
Ultimately, the sizing of a DWH is concluded precisely once all these requirements have been fulfilled. Storage of data in general is a secondary aspect, because separation (at least Serverless and new RA3 nodes) makes memory capacity more or less "unlimited" and only has a minor impact on the total costs of a Redshift. 20h of ra3.xlplus or 8h 8 RPU of Redshift Serverless cost about as much as 1TB month. Usage costs for servers, and not data volumes, are generally the cost drivers.
The ability to scale a database seems to solve these issues easily. However, we do not recommend using this factor exclusively as a solution, experience having shown that the development patterns described next usually have a much greater influence on query performance and are much more cost-effective to implement:
These development patterns are quite easy to consider and should always be included in data engineering. Delta processes and sort keys are usually the biggest "levers", as they easily allow performance gains of 100 -1000 by significantly lowering data processing requirements. Instead of 10 years of history, only the chronology of the last month or days is processed, according to needs.
The other best practices have less of an impact:
- Import from / export to S3 using compressed or column-based data formats (e.g. Parquet).
- Reduction in data communications between cluster nodes through the right choice of distribution keys.
- Reduction in data through a wise choice of data model (e.g. aggregate tables) or materialized views.
- Outsourcing to the lake as well as use of Redshift Spectrum and caching in BI tools or proprietary data storage in these BI tools to reduce query volumes.
- Processing of mass data outside Redshift, e.g. via AWS Glue, in order to reduce peak usage accordingly.
- Materialization of intermediate results in temporary tables to allow accurate performance analysis and "facilitate" Redshift query optimization.
In our experience, the influence of data vault vs. dimensional vs. 3NF modelling plays a secondary role, so that the measures described above can be applied independently of this influence.