Snowflake’s Openflow makes data integration faster, simpler, and more efficient. In this article, we’ll show how these benefits play out in practice—using a real-world example to highlight strategies for handling large volumes of small incoming files with ease and performance.
The Lakehouse Approach – Cloud Data Platform on AWS
Ingo Klose
Ingo Klose
Management Consultant
Ingo is a consultant specializing in data warehousing, data engineering, data architecture and cloud-based data platforms and advises clients from a wide range of industries, such as retail, internet and media & telecommunications. His project experience ranges from the conception and implementation to the operation of data-based solutions. In addition to his professional, methodological and technical expertise, his strengths include problem-solving, strategic and customer-oriented thinking and moderation.
In our free series of online events under the banner of Data Firework Days, we introduced you to the b.telligent reference architecture for cloud data platforms. Now we'd like to use this blog series to take a closer look at the subject of the cloud and the individual providers of cloud services. In the first of this three-part series Blueprint: Cloud Data Platform Architecture, we were interested in the architecture of cloud platforms in general.
Central services – Supplementary services for building and running a data platform and ensuring data governance
Data services – on-demand data access
Spotlight on AWS
In this blog post we look at how Amazon's Lake House approach fits in with our reference architecture and how it can be used to implement a cloud data platform.
So What Exactly Is the Lake House Approach and What Services Does It Offer?
The AWS Lake House architecture approach describes how various services are integrated in the AWS cloud. This architecture makes it possible to create an integrated and scalable solution for processing, storage, analysis and governance of large amounts of data. Unlike other providers' lakehouse concepts, this architecture goes beyond the mere integration of data lake and data warehouse technologies. It also includes streaming, data governance and analytics.
In addition to the cost-effective scalability of the overall system, the use of demand-oriented data services is also important for AWS. AWS maintains that a product based on a single service can never be more than a compromise solution. The provider is therefore a keen advocate of combining a variety of services to meet the user's requirements as optimally and comprehensively as possible.
The above diagram shows the individual AWS services available as part of the Lake House architecture, with the data lake at its core. The following services are used to build a data lake on AWS:
Amazon S3 – data storage
Amazon Glue – data catalog & ETL
Amazon Lake Formation – data governance
Amazon Aurora – direct access to the data lake via an SQL interface
The other services are then arranged around this central core. The advantage here is that, depending on the application, they can be used individually or in combination – and the cost is usage-based.
In this integrated system, communication between the services is also possible in different directions. A service such as a relational database can not only supply data to the data lake but can also receive reprocessed data from the data lake or data warehouse. The Amazon Glue Data Catalog ensures that data is available to all services via a common metadata view. The data in the Data Catalog is subject to governance via Amazon Lake Formation.
The concept can be easily understood by looking at the interoperability between the Data Lake and Amazon Redshift, the AWS cloud data warehouse service. The metadata in the data catalog allows a specific part of the data lake to be accessed transparently as part of the data warehouse. The Spectrum service within Amazon Redshift is used to do this. With Spectrum, a database defined in the Glue Data Catalog is represented as a schema integrated within the Amazon Redshift database. The tables in this schema together with the Redshift-native tables can then be used in SQL queries with joins etc. This use is still subject to governance by Amazon Lake Formation, however. So users can only access the tables, columns and data for which they have access rights.
Data Lake, Data Warehouse, Data Catalog, Etc. – What Parts of the Cloud Data Platform Are Covered by the Lake House Approach?
It is now possible to transfer the services and functionalities described in the Lake House approach to the majority of the b.telligent reference architecture for cloud data platforms:
Ingestion and processing
Streaming – Amazon Kinesis
Batch – Amazon Glue ETL
Data Lake - data storage
Amazon S3
Amazon Athena
Data warehouse
Amazon Redshift
Analytical platform
Amazon SageMaker
Meta-data management and GDPR services
AWS Glue Data Catalog
Amazon Lake Formation
Parts of the reference architecture are covered by AWS Lake House architecture
The remaining parts of the b.telligent reference architecture are covered by other services. These are not explicitly mentioned within the Amazon Lake House architecture approach, however, but are part of the large AWS service catalog.
Data visualization/reporting – AWS Quicksight
Automation & scheduling – Amazon Managed Airflow
Continuous deployment and integration – AWS Codepipeline and Code Deploy
Process & cost monitoring & logging – AWS Cloud Watch, Cloud Trail and AWS Cost Explorer
A complete implementation of the reference architecture with AWS services could look like this:
If you're about to implement a cloud data platform and would like to know more about implementing it on AWS, just contact us and we'll be happy to explain how we can guide you on your journey into the world of the AWS cloud.
Let’s Unlock the Full Potential of Your Data – Together!
Looking to become more data-driven, optimize processes, or leverage cutting-edge technologies? Our blog provides valuable insights – but the best way to tackle your specific challenges is through a direct conversation.
Let’s talk – our experts are just one click away!
Want To Learn More? Contact Us!
Your contact person
Arne Kaiser
Domain Lead Cloud Transformation & Data Infrastructure
Your contact person
Florian Stein
Domain Lead Cloud Transformation & Data Infrastructure
Who is b.telligent?
Do you want to replace the IoT core with a multi-cloud solution and utilise the benefits of other IoT services from Azure or Amazon Web Services? Then get in touch with us and we will support you in the implementation with our expertise and the b.telligent partner network.
Snowflake’s Openflow makes data integration faster, simpler, and more efficient. In this article, we’ll show how these benefits play out in practice—using a real-world example to highlight strategies for handling large volumes of small incoming files with ease and performance.
With Openflow, Snowflake fundamentally simplifies data integration: extraction and loading happen directly as part of the Snowflake platform — no external ETL tools required. This significantly reduces integration effort and streamlines the entire pipeline management process.
Exasol is a leading manufacturer of analytical database systems. Its core product is a high-performance, in-memory, parallel processing software specifically designed for the rapid analysis of data. It normally processes SQL statements sequentially in an SQL script. But how can you execute several statements simultaneously? Using the simple script contained in this blog post, we show you how.