(Near-) real-time processing with Azure Synapse Analytics
As an alternative to batch processing, data in this use case are forwarded to Synapse via Azure event hubs. Data processing takes place with the help of Spark Streaming and Azure Spark Pools, and is divided into various stages. A central service here is Azure Data Lake Storage Gen2 which reproduces the write-once, access-often analytics pattern in Azure. The employed storage format is delta, which offers higher reliability and performance for all data sources stored in ADLS, and is therefore very suitable for IoT data processing.
In ADLS, data are divided into different layers:
- Raw: Raw data are stored in delta format, and neither transformed nor enriched.
- Standardized: Data are stored in a standardized format with a clear structure.
- Curated: Data are enriched by means of further information.
- Export: Data are prepared for export and further processing.
The final export layer is required due to the still existent limitation of external tables with Azure Synapse Dedicated SQL Pools which are not able to read the delta format (Azure documentation).
Connections between event hubs and ADLS Gen 2 are established using Spark Streaming. A prerequisite for this is provision of an access token in Azure Key Vault, and queries using mssparkutils.