Skip to main content

A Basket Full of Snakes: Python modules for data science

Anyone who knows my former blogs knows that I am a big fan of both R and Python in daily work.

As powerful as R is in terms of functionalities for data analysis and modeling, as quickly is the motivation subdued in case of "number crunching" when RAM runs at maximum.

In this context, a nice server installation with a lot of metal (e.g. 96Gig-RAM) works wonders.

As this option is not always available, I have made a virtue of necessity and turned towards the more performant alternative, namely the Python based R alternatives, especially since I have been using Python for ETLs and data preparation for a long time.

Python-basede alternatives

Sounds easier said than done. As so often in the open source environment, a new ecosystem of tools with a rich, exotic and international flora of code plants is revealed. At first glance it was not that easy to clearly differentiate the abbreviations of the individual tools and their focuses. In addition, I very soon realized that the motivation for the developers and the professional backgrounds of the first users are clearly different to the ones of SPSS or SAS jockeys. They were clearly focusing on machine learning. What I mean is e.g.:

  • Clearly technical orientation: More possibilities to influence the data on low level by oneself. The proximity to the underlying implementation in C is recognizable and the explicit choice of data types is supported and pays off again.
  • High acceptance of black box procedures (for scikit): The end justifies the means and in case scoring with exotic procedures provides better performance, it is fine even if I am not able to illustrate the interaction of the predictors in a simple manner. 
  • Brute force leads to the target: Both the ex-works possibility for parallelization and the grid search procedures for parameter optimization quickly tempt the user to just let the CPUs glow :-).

Machine Learning in Python has not reserved these features for itself as they can also apply for the use of R even if respective extensions are frequently required. It is rather my subjective impression which arises when I see the documentations, the tutorials and the focuses of the functionalities. Even if you just take a look at the cheat sheet for scikit learn algorithms, at first glance it seems as if there are no conventional regression methods at all. The leap to regularized methods is directly made and instead of logistic regression, SVMs are suggested for a binary classification problem.

Even if this whole hotchpotch of python modules seems to be somewhat confusing and inaccessible - a basket full of snakes - I see an enormous potential in this technology. It enables very performant, formally stringent, highly automatable and easily deployable machine learning applications which are scalable to big applications.

Overview of Packages

The following list of elements is designed give beginners an initial guide through the jungle of abbreviations:

NumPy

NumPy provides highly performant data structures. It is excellent for the processing of data in extensive very vectors and matrices. The basics were implemented in C and Fortran, which is a reason for the good performance.
NumPy also includes possibilities for mathematical calculations; however, this is not the primary application purpose. 

NumPy's data types and data structures form the basis for all further Python technologies listed here. 

SciPy

SciPy builds up on NumPy and expands the latter by scientific computing possibilities. So to speak, SciPy and NumPy below can be seen as small Python alternative for Matlab.

SciPy is rather uninteresting for everyday analysis or the area of machine learning.

pandas

Pandas also builds up on NumPy and improves the usability of data transformations and analysis.

NumPy structures are packed into objects which are quite similar to R data types. Thus, there are series (R vectors) and data frames. Subsetting/filtering, new calculation of variables, aggregation and linking of data frames work relatively intuitively and are clearly inspired by R in terms of usability.

Pandas objects simplify the work by providing methods for descriptive statistic and visualization. The standards for an explorative data analysis are thus included right away. Regarding the wealth of ex-works methods, it is worth it to frequently browse the documentary.

scikit-learn

This package is Python's central machine learning library. It offers many modern processes for every area of machine learning. As each kind of modeling is addressed by the same interface, you can test various models against each other by still generic codes much more easily than in R. The included set of standard tools for variable transformations is also remarkable.
The downside in this context is the fact that the scikit modeling only models and does not do much more. There are no model summary, no charts, even some of the simplest diagnostics such as residues have to be calculated manually and so on.

The name scikit originates from "SciPy Toolkit". Thus, it was developed based on SciPy as SciPy itself does not include relevant methods.

statsmodel

Statsmodel is a very new alternative to scikit-learn which is significantly more inspired by R and is more focused on classic modeling and data mining. You get a nice summary with a useful selection of key figures after calculating a model. The model object also includes many re-usable calculations as one is familiar with from R. There even is an interface which accepts the model formulas in the same notation as in R.

IPython

First of all, IPython is an interactive command line for Python. The intelligent auto-complete dramatically facilitates the exploration of data. However, one cannot automatically equate the console with the iPython notebook. The notebook lifts the iPython console to a web-based working environment which includes many usability features, visualizes charts and output during work in the code and provides publishing opportunities in terms of "reproducible research". So whoever is looking for an alternative to Rs knitr necessarily arrives at IPython notebook.

matplotlib

Machine Learning under Python only becomes well-rounded by means of chic visualizations and here matplotlib steps into the breach. The graphics library which was originally developed independently from the aforementioned technologies, enables any conceivable visualization as all objects, however small, remain selectable and modifiable in a chart, provided sufficient skills of the developer. At the same time, it offers sufficient "high level" functions for beginners in order to quickly create the first charts on publication level.

Production Stage

The hype of machine learning in Python probably started properly in concurrence with the buzzword "data science". However, that should not create the impression that the described technologies are also brand new. On the contrary, the roots of the most modules, except of Statsmodels, go back much further in history and e.g. numpy can look back on 20 years. I myself, for example, have already been using matplotlib productively for presentation charts about 7 years ago. By now, one can actually say that the technologies have come to a reliable development level for a professional user. NumPy and Pandas are i.a. also used in finance and here, resilience definitely is no mere "nice to have". My reference to "professional users" is justified by the fact that one is currently still confronted with not very conclusive error reports quite frequently and a certain experience level is very helpful to circumvent the problem.

Beginning

Difficult Setup

If one wants to try these technologies, one will quite certainly fail at the initial installation of packages, at least when using Windows. Usually, a command such as "pip install sklearn" is sufficient on a current python installation. However, this is aborted unless a number of compilers, e.g. VisualStudio, are installed. As long as one has no computer science background or otherwise needs similar compilers, one does not want to clutter one´s installations with various compilers. An alternative is the installation of pre-compiled unofficial binaries, as I have used.

Anaconda

By contrast, the best beginning probably is the installation of an enhanced Python distribution such as Anaconda. This includes all aforementioned packages and many more. It is a little more extensive than a normal Python installation; however, the scope is customized for data analysis and there even are Continum Analytics Enterprise versions available.

Outlook

Of course, this overview is only a subjective view without any claim to completeness. Apart from that, the Python community develops so rapidly that this post will be out of date tomorrow.

Nevertheless, I hope this post helped interested parties to find their way in the jungle of terminology.