Skip to main content

Data-Warehouse-Automation (Part 2: Development)

A lot of what have hitherto been manual programming tasks can be replaced - or at least greatly simplified - by DWA tools. The precise elements of development that can be automated in that respect can vary greatly from tool to tool. For example there are pure code generator concepts, with the aid of which database structures and ETL/ELT processes can be generated automatically ("design time"). On the other hand, extensive integration suites exist that can generate but also manage the entire DWH lifecycle, from provision of the data in the sources right through to the data marts ("run time").

During the development phase there is a series of tasks in which a DWA tool can provide support. The following deals in particular with the fields of reverse engineering and compatibility, analysis, implementation and framework condition.

Reverse engineering and compatibility

DWA can be used in various project scenarios. These possible scenarios range from the extension to the modernization of a DWH, right through to a (partial) regeneration.

moegliche-dwh-projektszenarien

                                                                                    Possible DWH project scenarios

A DWH solution already exists in most cases. Accordingly, the extent to which compatibility with the old DWH exists can be decisive when selecting a DWA tool. Where some tools use the existing DWH merely as a source, others can connect to the already developed data model and develop it further.

The concept of certain DWA tools is that they are based exclusively on an already prepared data basis ("pre"-stage). This pursues a systematic separation between data provision and data processing. The stage thereby serves as an interface at which all the necessary preparations have already been made. There might be a need here for additional external tools for data manipulation and for loading the stage tables.

Other DWA tools provide for a holistic approach. These include specific connectors with which it is possible to directly access source systems, as well as the possibility for later data manipulations. This increases the chance of being able to depict the entire data management via one DWA tool alone.

When data is provided from the source systems, DWA tools can be subject to certain restrictions such as:

  • Only standard drivers such as ODBC are available
  • No delta determination possible
  • No real-time processing possible
  • No connection to cloud platforms or Big Data applications possible

With the aid of an upstream inventory of existing and potential source systems, DWA tools should be checked in advance for possible compatibility problems and other limitations.

Analysis

The nature of the delivering source systems must be analysed so that database structures and data management processes can be created automatically. In most cases, DWA tools enable the reverse engineering of database structures, during which the basic information can be collected. Here too it is also necessary to again check compatibility with different databases. The basis for this is the usual DBMS meta information such as column types, integrity conditions, relationships and also extended object statistics. However, source systems that have a complete and all-encompassing database exist only seldom in operational practice. Platform-dependent optimizations and user-driven modifications can make the automated analysis of source systems more difficult. However, the quality of the generated DWH/ETL code is greatly dependent on the level of detail in the meta information available to the DWA tool. Here the DWA tool must provide the developer with the most practical tools possible in order to add missing information to the meta data. Most of the perceived usability depends on this scope of activity, because this is one of the most important determining factors with which the developer can influence the tool's behaviour. Consequently, not taking account of the importance of individual objects and their relationships can result in entire parts of the system being restructured later.

In addition to the structural analysis, data inventories can be analysed in advance by profiling functionalities. Here, the DWA tool support extends from purely automated placement of various queries and the subsequent output of the results right through to design suggestions. Consequently, it may be appropriate to transfer from the sources objects with large volumes of data and a high rate of change in the delta process and (near) real-time, and others as a full extract on a daily basis.

A data lineage should still be possible on the DWH side, since all the meta information required for this is already available to the tool. The crucial difference is made here by the form of the lineage visualization and the possibility of developer interaction. A complete lineage is made more difficult by particular levels of developer freedom. For example, if dynamic SQL or external code is processed at a certain point in the ETL process, it is almost impossible for the DWA tool (and in some cases also for humans) to fully comprehend the process flow. At this point it is the task of the DWA tool and the developer to find a healthy middle ground in terms of levels of freedom and standardization. 

Implementation

To date, DWA tools have not yet been able to completely abstract the actual DWH, with the result that it seems like a black box to the developer / professional user. Depending on the level of automation, the actual developer activity inside the various DWA tools can vary greatly in terms of scope and complexity. The broader the choice of uses for the DWA tool, the greater the extent to which non-automated or partially automated interventions by the developer are possible:

dwh-automatisierungstool-metadaten

                                                      Manual coding with data warehouse automation tools

As the figure shows, manual interventions can be required across the entire DWH layer architecture. So-called "hard rules" must be defined for the transition from the source systems into the abstraction layer. Source-specific data types must possibly be converted and harmonised, or a structure unsuitable for the DBMS must be pre-structured. Specific questions and trade-offs can play a role here, which make a human decision essential. In an ideal case, hard rules only have to be defined once for each source system and the DWA tool offers already pre-defined solutions.

During the transition from the abstraction layer to the integration layer on the other hand, "soft rules" are typically applied. The actual integration work required for a uniform and company-wide database - the core DWH - is done here. This includes, among other things, the harmonization of business keys.[1] If the core DWH is divided, as can happen for example with the RAW Vault and Business Vault and in the case of the data vault models, the soft rules apply instead to the transition from the RAW Vault to the Business Vault. On the other hand, tasks such as the technical  historization of the data can be completed more easily with the aid of the DWA tool.

For the presentation layer, the data is prepared so that it meets the requirements of the specialist department in terms of form and layout. Intervention is also necessary here if tasks arise that exceed the simple technical generation of a star schematic or a simple view. Some tools go one step further and can pre-generate data structures for common reporting tools.

In the case of manual interventions, their synchronization with the DWA tool is decisive. Interventions must not only be made possible or simplified, they must also be embedded in the way in which the DWA tool works. Otherwise these could be overwritten in the event of a new code generation, or could even cause far-reaching complications such as inconsistent data. Few interventions are necessary if the tool is limited merely to the generation of the code for creating database objects and the data management processes of a particular part of the system (for example the staging). However, many of the necessary steps must be created with separate software such as a database development tool.

Framework conditions

In the project business, it can be the case that not only several developers from one company but also additional teams of developers from other companies are working on the same database structures. In that eventuality, the metadata of the DWA tool are affected just as much as the database structures of the DWH-DB, thereby giving rise to several requirements. It must be ensured that changes made by one developer do not thwart the changes made by another developer. To achieve this, some DWA tools utilize the functions of version control systems (VCS). If the DWA tool uses OS-based configuration files, the typical VCS functions like the checking-in and checking-out of objects and also MERGE can be utilized in order to simplify the cross-company development. Other DWA tools save their meta information exclusively in database schemes. This allows multiple editing of the same object to be regulated by the GUI of the DWA tool and also by the transaction security of the database. In addition, rolling back to an earlier consistent status is possible via DB backups and flashbacks, although this works only to a limited extent at single record level.

Summary

At the beginning of a DWA initiative, it should be clearly established which sub-areas of the development are to be covered by the DWA tool. If for example extended profiling functions are required beyond purely (partial) generation of code, an extensive integration suite is more appropriate than different tools with, in some cases, overlapping scopes of functions. The more over-arching the configuration of a DWA tool, the greater the extent to which the developer must also be able to interact with the tool. The main challenge for DWA tools here is to offer an intuitive and plausible feel despite covering an increasingly broad range of functions, so that the advantages of automation can generally be utilized comprehensively and consistently. There is still considerable scope, particularly in the way in which the different DWA tools support the developer in the definition of soft rules, and this could in future become an important defining characteristic.

It would be wrong to assume that using DWA tools means that less specialist know-how will be necessary for the development of a DWH. As described above, many decisions must still be made during the DWH development, with many problems to be solved. In contrast, there is more of a refocusing of tasks away from detailed technical questions such as the historization of database structures and towards conceptual matters. Consequently, development activities with a DWA tool are increasingly defined more by the introduction of new standards and constantly questioning existing ones in order to develop the best solutions for an individual DWH system. However, practical experience has also shown that DWA tools can also strictly demand certain standards, and this can limit the scope for action. For example, a certain layer architecture, modeling variant, naming convention or loading strategy can be defined. This often leads to an alignment of processes and developer behaviour on the DWA tool, for which allowance should be made particularly by companies with existing DWH solutions or strongly defined standardization.


[1] If the Data Vault is used for modelling the Core DWH, this step is typically completed during the transition from the RAW Vault into the Business Vault.