Apache Spark - 5 tips for a successful start

Apache Spark will be 10 years old this year. Started in 2009 as a project at the University of California in Berkeley, Spark has become the leading framework for distributed computing in clusters, and is still enjoying growing popularity. The reason for this is its unprecedented versatility, in particular: Everything is conceivable and possible, from local installation on a laptop to deployment on powerful computer clusters comprising several thousand nodes, from implementation of ETL processes to real-time AI-applications. More than 1000 developers and an active community help make Spark useful in ever more areas of application.

How does one approach this colossus, and how can one learn Spark as a beginner? I started from scratch, and have familiarized myself with Spark in recent weeks. In this post, I would like to share my experiences.

Set a goal

Apache Spark is very extensive, and therefore easy to get lost in. To prevent this, I set a specific goal for myself, something which has proven itself in my experience. This makes it easier to retain a focus and avoid being distracted by other topics (which, of course, are also interesting).

I set myself the goal of obtaining Spark certification by Databricks. Apart from a new entry in my CV, another advantage here is a need to prepare intensively for the exam, in which process one learns a lot and concentrates on the required content. The requirement for a knowledge of certain topics of little relevance in practice (e.g. RDDs) was not a disadvantage from my point of view: It lays the foundation for a better understanding of the system.

Certification need not be the goal here; other possibilities include a specific project to be implemented with Spark, or working through a textbook. It is important for the goal to provide guidance which serves for orientation, in order to retain a focus despite the seemingly endless possibilities which Spark offers.

Install the learning environment

To learn Spark (or any programming language), one has to use it. Programming is learned only by doing. Therefore, the first step here is to install Spark. The easiest way to do this is through local installation. Spark can be installed after being downloaded for free from the download page.

An important point is the choice of programming language and development environment. Spark provides APIs for the languages Scala (in which Spark is written), Python, R, and Java. I recommend choosing a language which is already familiar to the user. Otherwise it will be necessary to learn a programming language and Spark simultaneously, which is not good in my experience. What is language-specific? What is Spark-specific? This is hard to distinguish for beginners who do not yet know either.

For this reason, I decided to prepare for Databricks certification with Python. I use Jupyter Notebook as a development environment, which allows program code, comments and notes to stored very conveniently at one location.

Work through the basic course

Now it's time to start learning. In my view, there are two options here: Either you work through a basic course (which matches your chosen goal - see above), or you use the Spark documentation. The latter contains the programming guides, which explain the core components of Spark (RDDs, structured APIs, streaming, machine learning, graph processing) in the style of a textbook. All code examples are available in Scala, Java and Python. SparkR (R on Spark) is treated in a separate chapter. Also available, naturally, is the formal documentation of classes for Scala, Java, Python and R.

I opted for the first method by working through the book Spark: The Definitive Guide by Bill Chambers and Matei Zaharia. Not surprisingly, there is a large selection of books on the fundamentals of Apache Spark; the reason for my choice is the fact that the book is written by the leading developers of Spark, who also founded Databricks. In turn, Databricks, is the provider of the certification I aspired to.

Whether you learn with a book, videos or e-learning is primarily a matter of personal preference. Whereas some people learn better in video sessions, I turn out to prefer books. For me, the big advantage of books is that I learn at my own pace and can look up items again quickly (especially when using e-books).

I think it is important to begin with the basics before proceeding to specialized topics. A good start is provided by "A gentle introduction to Apache Spark", which is the first part of Spark: The Definitive Guide, and freely available. After an overview, I highly recommend a deeper insight into structured APIs (DataFrames, Spark SQL); this is the basis for all subsequent, specialized topics. The low-level APIs (RDDs) are less important in practice today and need not be studied in my opinion, unless Databricks certification is the goal; RDDs are still a subject of detailed examination there.

Delve into special topics

After a study of fundamentals, the specialized topics are next: These can comprise machine learning, structured streaming or graph processing, depending on your goal (see above). All these topics are treated in various degrees of detail in the book by Chambers and Zaharia.

With regard to these topics, I have found that it is worthwhile to refer to the programming guides in the Spark documentation.  Provided here, especially for the mentioned topics, are excellent and not overly digressive explanations with corresponding code examples in Python, Java and Scala.

Develop your own project

After you've acquired the necessary knowledge, it is time to put it into practice. While reading books or visiting online courses, you do already write a lot of code, but usually not your own.

Choose a specific assignment. This can involve a machine-learning problem (interesting data records are available here, and here), or production of a report by preparing and transforming database tables. It is best to work on an assignment based on one's own activities.

Programming is learned only by doing. The more you work with Spark, the better you will therefore get at it.

Summary

An introduction to Apache Spark is easier than beginners imagine. A local installation on one's own laptop is possible without much effort, and there are good, also freely available textbooks and tutorials. It is most important to actually work with Spark, starting best of all with a small project of one's own.

Views: 0
clear