In these times of digitalisation is it particularly important to be able to draw on reliable databases in order to eliminate errors at the source and facilitate a focused and precise way of working. The staging area is a solution for this type of challenge in today’s world.
Designers and architects often underestimate the need for a staging area for the database environment as they consider it a waste of space, effort and development time. Developing staging certainly requires space and effort, but this pays off over the whole life cycle of the database.
The staging area is defined by the classic design of a database which includes an intermediate area (staging area) consisting of 1 to 1 copies of tables from the source system, as shown in the diagram below. Data is temporarily stored in the staging area where it is cleaned and transformed. After these processes are complete, the processed data is transferred to the relevant target database in order to be used accordingly.
This article will explain the advantages of integrating a staging area as well as the arguments why staging should also be integrated into the architecture of a data warehouse.
1) Facilitating the analysis of error sources
Errors in the presentation area are often first brought to our attention through indications from commercial users. Error sources here can have varying natures. They can either be found in the ETL process (extract, transform, load) or it can involve flawed data in the source system, misunderstandings or false requests, for example.
To find this out, an analyst or a responsible person should compare figures from the presentation area with the source systems. However, this can be an enormous task as it could mean deriving just one number in the presentation area from many tables for various source systems. Whether, for example, the extent of revenue was correctly calculated cannot be confirmed if data from offline businesses is found in the oracle database and online sales are found in different MySQL databases. To correlate this, an Excel table must be created that extracts and combines the data from source systems and compares it with the presentation area. This is an elaborate process and prone to error.
A staging area avoids using this procedure meaning a single SQL query is enough to combine all source tables as they are all found in a single database. They can be automatically compared with the presentation area and any shortages are rectified. The probability of errors occurring for simple 1 to 1 transformations of source systems in the staging area is just 0.1%. On the other hand, the risk of making errors in further transformations lies at 99.9%. An analyst can determine relatively easily whether an ETL process has been correctly completed and can point out the corresponding flawed lines.
2) Staging area test phase and testing in exceptional cases
During the test phase it is important to have a stable dataset in order to test different versions of ETL. The staging area offers a clear advantage here of locking the dataset and conducting additional ETL processes and tests. Complex test schemata, such as the load staging, can also be implemented and the saved copies can then be used for ETL in order to quickly simulate the total weekly processing.
This process is not possible for data in source systems as these are also used for other purposes.
Exceptions cases are also often tested such as, for example, ETL processing situations that do not currently exist in the source data. For example, this includes sales that are linked to a non-existing customer. Can the ETL process this situation as planned? If the database has a staging area, these situations can be moulded in the staging area and tested through the ETL.
3) Staging area performance aspects
People who oppose the system claim that additional processing is needed when filling the staging area and this ultimately has a negative impact on the ETL performance. Even if certain resources are required for filling staging tables, this can be advantageous for the performance later on. This often happens when a source table is involved more than once during the filling of the core area. Thus copying in the staging area can be noticeably advantageous for performance. And this is not only true for the database, even the workload that a database gives to the source system must be optimised. An architecture with a staging area enables simpler source system queries to be completed in which each table is only used once as to keep the workload for the source system to a minimum.
ETL tools can also use more effective techniques to carry out the process when all sources are found in the same database. An example of this would be Informatica which uses Pushdown optimisation or ODI (Oracle Data Integrator) which requires no LKM (Load Knowledge Module) for further processing.
The conversion of tables in this same database is required so that the ETL tool can create a link. If these requirements are not met, ETL tools have no other choice than to copy these in the same entity. By introducing a staging area, an architect makes the same but in a manner which can be managed more easily.
As the staging area is managed by a database team, the team can record database statistics that are necessary for improving the ETL and can refer back to them at any point. There is no guarantee that the source system will contain all necessary statistics for the database SQLs. This is why queries that access the source system might not function at an optimum level.
Another advantage of a staging area is the creation of an index which then becomes important as soon as a staging table is used more than once for transformations. This is an easily manageable task and requires no support from a database administrator whose main job involves supporting OLTP activities (real-time transaction processing).
When a staging area is used in the data warehouse, the distributed environment consist of very simple 1 to 1 copy tasks from source systems and all related processing is carried out in a single database. Most performance problems arise in a single database environment. In order to avoid this, an optimisation of performance is carried out after a while.
However, if the architecture does not include a staging area, there are many complex tasks that access different environments. This often involves a significantly more complex task when optimising a distributed system and requires people who have experience with different technologies. Finally, there is the possibility that the database team does not have sufficient authorisation to solve performance problems on the source system page and, potentially, cannot consult the design plan, table structure, hardware configuration etc.
4) Procedures, sequences and the implementation of requirements by an analyst
After all data from the various source systems has been integrated into the database, the commercial users lay down requirements for the analyst who then deals with these requirements with their analysis. In order to be able to meet user requirements, the following conditions should be met:
- Access to all source systems
- Arrangement of a window for performing queries
- Experience with all current source technologies
- Extensive technical skills in order to combine source data
A single staging environment facilitates work and time spent of processing requirements immensely, particularly when staging tables also contain columns that are not yet used in the database.
The staging area is particularly attentive in order to tackle this type of avoidable problems. It helps to create and support your project more flexibly and even leads to improved performance.