The diagram gives an overview of our Data Lake framework. In the first step (blue box), a data sample for a use case is loaded to the s3 config bucket. This bucket contains the files providing the base for the configuration information. A glue crawler is started as a new file arrives to the bucket. Thereby, our Config Data Catalog is populated with the schemas, tables and columns for the Lakehouse. For each file which is added, one table is created in the Config Data Catalog. The next step is the only individual, manual step. The different columns in the Config Data Catalog are the central location to define Metadata such as primary keys, PII Data or business keys (check out the screenshot below).
We will benefit from this Metadata once we get to the ETL Process.
Now, as data arrives to the landing zone bucket, a trigger will start the framework’s data pipeline to process the file through the Data Lake layers. In our first implementation, only the batch processing of csv files has been implemented, but the framework can be extended to other file and ingestion types.
As the first step of the data pipeline a glue ETL script is started to transfer the files to the Raw Bucket. We implemented the automatic trigger as an AWS Lambda function. But another good, recent option on AWS to start the pipeline are Glue Workflows. Now, the Metadata from the Config Data Catalog, such as column names and which columns are PII data, is used for the necessary file specific variables in the ETL script. As part of this processing step, the data is enriched, e.g. by a loading timestamp, and transformed to parquet, a columnar file format, to efficiently store data in the Data Lake.
As data arrives to the Raw bucket, a second Glue Job is triggered by a Lambda function. In this step, the Metadata from the Config Data Catalog is applied to separate PII and Non-PII Data. As we developed the framework, we chose to handle PII-data by separating the data into two standardized buckets, one with PII and one with non-PII Data.
As the final step, glue crawlers for the standardized buckets are started. This finalizes the standardized data layer. Now, you can easily create all the data analysis and insights you can imagine. Further, if additional Data Model Layers are needed, those can easily be connected.