Neural averaging ensembles for tabular data with TensorFlow 2.0

neural-averaging-ensembles

Neural networks for tabular data: ensemble learning without trees

Neural networks are applied to just about any kind of data (images, audio, text, video, graphs, ...). Only with tabular data, tree-based ensembles like random forests and gradient boosted trees are still much more popular. If you want to replace these successful classics with neural networks, ensemble learning may still be a key idea. This blog post tells you why. It is complemented by a notebook in which you can follow the practical details.

Tabular data: gradient boosting rules

In 2017, self-normalizing networks were invented by Günter Klambauer and co-authors with the explicit aim of making neural networks fit for tabular data. Their approach was to improve dense feed-forward networks, the standard neural network architecture for tabular data, by using a new activation function. With this modification, dense feed forward networks can be trained which are much deeper than the previous limit of about four hidden layers. While self-normalizing networks work well and achieve results that are comparable with gradient boosting and random forest, they do not work well enough to break the dominance of classical ensemble methods in the field of tabular data – at least not yet. One of the reasons for this was probably that self-normalizing networks do not provide a very rich or versatile theory for tailoring the right network for a particular application. They merely enable us to go deeper than before (which of course is a large step forward anyway).

A gem from AI winter

Last year I gave a conference talk on self-normalizing neural networks. I was trying to popularize the idea that the time is right for tabular data modelling experts to add neural network techniques to their toolbox. I am still convinced that this is true. But only one day after I gave that talk, I had the honour and the pleasure to attend a small, intense workshop with Hans Georg Zimmermann. During this workshop, I realized that I had been looking at neural networks in a quite superficial way, lacking much of the mathematical depth and sophistication of the theories presented in the workshop. I also learned that there are much older methods for attacking tabular data than self-normalizing neural networks, methods whose development started already during AI winter. And in this case, oldies may be goldies.

It seems to me that these methods have not yet received the attention they deserve. It is questionable whether I will be able to change that with a blog post for a probably quite limited audience, but I will do my best. I will try to explain some first steps; there is more to this than fits in a single blog post. All good ideas presented here I learned from Hans Georg Zimmermann; all mistakes and faults in this text are mine. In particular, the use of embeddings below is, for better or worse, my own variant of these methods.

Overparametrization: bug or feature?

Although many of the details of why neural networks work so well are yet to be understood, it becomes more and more clear that they work because they are heavily overparametrized, not despite of it. To understand that, you must not look at them with the eyes of classical statistics, which would tell you that typical neural network models have way too many parameters. This is a problem indeed, but we will deal with it later. For the moment, look at the training process through the lens of optimization theory. If you do that, you will realize that these seemingly superfluous parameters protect you (or rather the training process) from getting trapped in a local minimum. And that, obviously, remedies a big worry: in deep learning, none of the convergence guarantees from classical optimization theory apply. Getting trapped in a local minimum would be sort of a default scenario, if we didn’t have overparametrization.

Now how about the problematic aspects of overparametrization? There will be not only one optimum weight configuration for our neural network, but lots of very different ones. And we can’t know which is the right one. Ensemble learning gives us a very pragmatic way out of this dilemma: Just take a few copies of your neural network, each with endowed with a different optimum weight solution, and average over them. This effectively reduces the ridiculous amount of variance generated by overparametrization. You can encode all of this in a single neural network: Take identical copies of a neural network (the weak learners) and connect each of them to the input data and an averaging neuron as the output. When you train this network, the different random initializations of each of the weak learners will make sure that they all arrive at different optimum weights when the training is over.

A complete recipe for sizing your neural network architecture

If you follow this argument, a neural network architecture that averages over identical copies of smaller neural networks (weak learners) is not as silly as it may look at first sight. I like to call this kind of averaging architecture a neural averaging ensemble. We haven’t talked about the weak learners yet, the building blocks of these architectures. What kind of network should we choose? Well, whatever you choose, you will have several copies of it afterwards, so it may be a good idea to keep things small and simple. The Universal Approximation Theorem.

tells us just how small and simple it can get: a single hidden layer will suffice. Of course, networks with more layers are often far superior in practice. But let’s stick to this minimalistic approach for the moment. If we do, the only thing that remains to decide before we have completely specified our weak learner is the size of the single hidden layer. How many neurons do we need?

The answer is surprisingly easy: We fit a single weak learner on the training data (and for speed’s sake, we don’t need to fit it perfectly, a rather low number of epochs will suffice). We start with a low number of neurons in the hidden layer, say, 10. After training the network, we check whether any of these neurons is superfluous, i.e., its output is highly correlated to another neuron. If not, we try with a higher number (say, 20), until we have two highly correlated neurons. In this way we make sure that the weak learner is overparametrized, but not ridiculously so. In practice, you may want to repeat this procedure with several different initializations of your random number generator, to make sure your choice is robust. If in doubt, use a higher number of neurons.

The only thing that remains for a complete sizing of our architecture is the number of copies (of the weak learner) we want. If you’re a person who likes to be in control, you could choose the number of copies by using an analogous approach to the one we used above for the hidden layer: Train an ensemble with a certain number of copies, then try larger ones step by step until you have a few copies whose output is highly correlated. I encourage you to try that, and I’d be happy to hear about your results.

However, the number of weak learners in an ensemble has three properties that will enable us to take a more laid-back approach to choosing it:

More weak learners make a better model (or at least one that is not worse)
The model quality as a function of the number of weak learners has a nice asymptotic behaviour in which adding more learners gives you less and less benefit the more you already have.
The parameter is quite robust: choosing the “wrong” number of copies won’t ruin your model

In the light of these amenable properties, let’s take a shortcut: Choose a low number of copies (like 10) for your first experiments, and go up to something between 50 and 100 if you want to finalize things. This will usually work well. If you encounter situations in which this is not the case, I’m eager to hear about it.

Categorical complications

Die oben beschriebene Modellierung funktioniert gut, solange wir uns auf Regressionsprobleme mit rein numerischen Eingangsdaten beschränken. Für kategoriale Eingangsvariablen und bei Klassifikationsproblemen (also bei kategorialem Output) müssen wir das Netz modifizieren.

All of the above works well as long as we restrict ourselves to regression problems on purely numerical inputs. With categorical inputs or classification problems (i.e., categorical outputs), we need a few modifications to make things work in practice.

For classification, this is just the usual softmax layer (on top of each weak learner) to produce probabilities for each class. If you only have two classes, you can also use a sigmoid layer with a single neuron to just estimate the probability of one class. That is also the way I implemented it in the notebook.

Categorical inputs could be fed into the network by using one-hot encoding. This approach has well-known drawbacks (above all, a skyrocketing input dimension). Additionally, it often leads to huge weak learners when you use the above method of sizing. Fortunately, there is also a well-known remedy to this problem: Embedding layers.

For each categorical input, we add an embedding to our weak learner right after the input layer. When preparing the data for the embedding layer, we must replace each category with a unique integer between 0 and the number of categories minus 1. Instead of giving each weak learner its own embedding layer, we can also let them use one shared embedding layer; the notebook enables you to experiment with both options. My impression is that usually a shared embedding layer works fine, so there is no need to bloat the model with separate embeddings.

Embeddings come at the price of additional hyperparameters: You must choose the dimensionality of the output for each of them. There are several approaches to do that: You can use randomized search together with the other parameters, you can derive the output dimension from the number of categories or you can try to just guess a number based on your intuition and some ideas on the interpretation of the data you’re using. I went for the last approach. Choosing the output dimensions is an interesting fine point, and I’m eager to hear how you deal with that and how your approach works out in practice.

As good as gradient boosted trees

When comparing the performance of this approach to gradient boosted trees within the length of a blog article, we are forced to take a less-than-scientific approach. To demonstrate that these methods work well in the middle of gradient boosting’s home turf, we choose a very small dataset, and one that contains lots of categorial variables. The latter is important, because tree-based methods usually shine on categorical data. Fortunately, Szilard Pafka has done a lot of gradient boosting benchmarks based on a dataset with many categorial variables. He has tested different implementations for various subsampled versions of the same dataset. The smallest one has only 10K rows, and it’s the one we will be using here, just to fight the myth of deep learning always requiring lots of data. The best test AUC in Szilard’s benchmark for the 10K dataset is 70.3 (achieved by Xgboost). Note that Szilard didn’t do an extensive grid search on the hyperparameters, but a smaller, more exploratory one. As we did even less hyperparameter tuning, comparing our results to his does not put deep learning methods at any unfair advantage. When you look at the notebook you see that we achieve 69.5 with the neural network approach, so we’re roughly on the same level of performance. This optimum performance for the network is already achieved after 4 (!) training epochs, so training will even be faster than with gradient boosting.

Try this at home!

I hope this demonstration has whet your appetite. Of course, this is only a single example, with a single dataset. As we all know, this doesn’t prove much. So please, try this on your own data and tell me how it goes!

There is also much more where this comes from---we have only scratched the surface of a rich theory. The techniques here can be extended in several directions, and we can even extend the analogy to classical ensemble learning further by going from averaging to boosting, producing even more accurate models. There are also specialized techniques for model transparency. I may cover some of these topics in future posts. In the meantime, please try this stuff for yourself! I’m eager to hear about your experiences.