Highlights of Spark Summit Europe 2017

I am pleased to once again report live from Spark Summit Europe in this blog post. It is the third European summit so far and is taking place in Dublin this year after Amsterdam in 2015 and Brussels in 2016.

The event

The Summit lasts for three days and kicks off with a day devoted to a series of intensive training sessions, with the official opening and the first keynote speeches taking place on the morning of the second day. As always, part of the event is being opened by the original Spark developer Matei Zaharia in person.

This year, around 1200 participants will be able to attend 100 lectures in five separate tracks covering development, streaming, data science, technical deep dives and data engineering. Despite all this, I think it is still a relatively informal event, where you will find the original Spark crew mixing with everyone else during the breaks. During the keynote speech, I noticed that Holden Karau, one of the original developers and author of several Spark books, was sitting just behind me.

But back to what is on offer. Before my first Spark Summit, I imagined it would be very big-data-centric. It turned out, in fact, to be very much a specialist data science event. Almost every practically oriented lecture contains a complex machine learning use case. All the applications contain big data sources, but the particular benefit of Spark is that this aspect takes a back seat to performance and usability.

Update from Matei

The first keynote speech by Matei Zaharia described some of the new features of the soon-to-be-released Spark 2.3. Databricks and the Spark community currently have two development priorities: streaming and deep learning. Both are expected to simplify Spark for the end user.

Structured streaming

The aim is to provide a "high level end-to-end API" with a very simple interface. This should make it possible to retrieve a stream in the same way as a DataFrame – also in the Spark SQL format and combinable with DataFrames from batch calculations. Structured streaming is rounded off with the promise of "exactly-once processing" – in other words, fault-tolerant stream processing.

Deep learning

Spark 2.3 will also provide a new API, the deep learning pipelines. This should enable more users to make productive use of deep learning. The API is based on ML Pipelines and supports mainly Tensorflow and Keras.

A Databricks employee demonstrated an example of deep learning by classifying shoes on the basis of images from online shops. The example requires only seven lines of code (!) and was easily explained in a few minutes.

Further improvements

PySpark users will no doubt be delighted to hear that one of the biggest brakes on the performance of PySpark will be removed in Spark 2.3. It will soon be possible to use vectorized pandas functions as UDFs (User Defined Functions). There will also be further performance enhancements for Python and R, and better support for Kubernetes.

Databricks Delta

Databricks CEO Ali Ghodsi surprised the audience with the premiere of Databricks Delta.

Delta is a unified data management system that extends the existing database ecosystem. On the basis of simple mass storage (such as S3, HDFS with Parquet), it offers a structured abstraction layer that promises ACID conformity, high performance and low latency.

In the words of Ali:"A unified management system for real-time big data, combining the best of data warehouses, data lakes and streaming."

In a live demo, there was an impressive demonstration showing how SQL was used to query a Kafka topic and how the results were updated in real time.

Program priorities

Streaming and deep learning are also the main themes for the other presentations during the two days. In streaming in particular, some serious competition has now arisen for Spark (Kafka Streaming, Apache Flink). According to Databricks, the next release of Spark 2.3 will catch up significantly and even greatly outperform the competition in terms of performance. It will be interesting to see which technology will prevail for streaming. As far as complete "end-to-end" applications are concerned – and here I would include machine learning models as well as streaming – I see Spark as unrivalled.

So I say goodbye to you from Dublin as I hurry off to my next lecture.

Here are some impressions of Spark Summit Europe 2017.