Skip to main content

Missing Values in Logistic Regression

Flexibility of logistic regression

In addition to decision trees, logistic regression is the workhorse in the modelling in order to forecast the occurrence of an event. Fortunately, both methods are designed in a way that one can basically use any kind of predictor for the prediction, whether dichotomous categories, multi-level categories or continuous variables on interval scale level.  Especially the logistic regression, however, has no possibility to reasonably deal with missing values. In social science research or market research, one often makes do with limiting analyses to complete data sets. 

This approach always bears the risk that a customer group is systematically neglected. The mere fact that values are missing may already provide information about a customer or study findings.

When dealing with business data, the number of cases is generally big enough to also use the missing values for scoring. In case of categories, one just forms a new category which includes these cases.

Difficulty of steady predictors

But how do you deal with the missing values in case of steady predictors? Suppose one wants to develop a churn score, uses the customer age as predictor and recognizes that about one-sixth of the customers has not entered an age. In addition, one frequently discovers more customers with incorrect data entries which also have to be considered as 'missing'.

If the age was complete, one would consider it in the model in such a way that each additional year changes the odds ratio for termination. In this context, this relation is always linearly increasing or decreasing.

In literature, there is a variety of techniques to deal with missing values which all have their justification and their field of application. Software packages such as SPSS provide the opportunity to replace missing values by the median or the average value. According to the proportion of missing values in the total set, this approach presupposes many potentially wrong assumptions.

Best Practice: Transformation of the Steady Predictors into Equal-Sized Discrete Steps

As best practice, I clearly recommend the recoding of steady predictors with missing values in the linear regression to a number of categories. At first, I would divide the variable 'age' from the example into 9 or 10 equal-sized percentiles and consider the missing values as additional category. For the reference point, i.e. which category serves as basis, one would do best to choose a margin category, i.e. the highest or lowest margin. If a clear steady relation between age and the churn likelihood actually exists, one will also detect it by means of the coefficients of the 9 or 10 individual categories. If neighbouring categories do not differ from one another, it makes sense to combine them to a bigger category. This approach also has the advantage that correlations with minima and maxima, i.e. curvilinear correlations, may be detected. It might be that especially young and old customers have a churn risk and the middle is rather characterized by stability. Irrespective thereof, one can also make forecasts for the customers with missing age. If the coefficient of this category significantly differs from the other age coefficients, it strongly points to a systematic effect. In this case, one has identified an individual customer group which would be swept under the carpet when using other methods.