The second day of the Predictive Analytics World: Burglars, sensitive data and what one can learn from it about data science storytelling.
The second day of the Predictive Analytics World was even better than the first day. The first presentation was already something special. It was about predictive policing. The lecturer Thomas Schweer is a genuine criminologist, an expert for organized crime and terrorism. However, predictive policing is rather something for mass crime; thus, the presentation was particularly about burglaries.
Predictive Analytics World Berlin 2015
The pattern which allows forecasts here is the so-called ‘near-repeat phenomenon‘. This means that after a burglary, there is a high probability of another burglary in the immediate neighborhood within the next 72 hours. There are indications that these series can be traced back to professional gangs of burglars. However, at approximately 15%, the clearance rate is very low so that this presumption cannot be ultimately verified.
What made the lecture very interesting and entertaining was not so much the methodological goodies; the prognostic algorithm behind the presented system is, as one may assume, rather simple. It was the insights into the criminology of the burglary and the cooperation with the police. One learned a lot about the approach of professionals (‘only take what fits into a sock‘) and the prospects of success of burglary prevention (‘sometimes, there even are arrests’). The skewed distributions, well known from the business context, appear again: It is assumed that 4% of the offenders commit 40% of the burglaries. However, the presentation was not only valuable because of its entertainment factor but also because it implicitly reminds of one important fact: professional knowledge is just as important as statistical-algorithmic knowledge.
Predictive methods for payroll outsourcing
The subsequent presentation by Phillip O’Brien was a great example of how a business experienced in the handling of predictive methods is able to make proper use of these - carefully and tactfully even in a very sensitive area. The business is Paychex, an international provider of payroll outsourcing and related services, and the sensitive area is a churn score. Sensitive because it was about terminations, not by customers, but by employees.
At first, caution and tact were required for selecting the predictors. All predictors which involved the risk of discrimination had been excluded from the beginning: age, gender, nationality etc. A further highly sensitive class of potential predictors was also excluded from the modelling: all information pertaining to income. On the one hand this simplified the approval of the raw data for use in the model. On the other hand, it prevented the risk that the model is misused as argumentation aid for the personal interest of individuals (which is always high in a wages setting).
The modeling itself was an unspectacular logistic regression. For use in the business, only five (and not ten or twenty, as elsewhere) score classes are formed, namely in in the style of Anglo-American marks from A (best, here: lowest termination risk) to F (worst, here: highest termination risk).
Classification into score categories
In this case, however, the results on individual level have never left the data science department. Only information aggregated for spatial organization units was used; a further prudent decision. Phillip O’Brien has brilliantly phrased why it would not have been a good idea to add a score level from A to F to each employee name: “We feared that people might tell themselves stories about their score.“ In addition, he named examples of what kind of stories that could be: “I scored an A. Now, my boss thinks that he/she has me for certain anyway and thus does not take proper care of me anymore.” “I got an E, but I want to stay! Now I will get a kind of attention I do not want.” “I have a C. That is meaningless mediocrity. Now, all others are more important: the loyal employees with the As and Bs that one can rely on, and the unsatisfied employees with the Es and Fs who need to be taken care of.“
Storytelling as Data Scientist
Of course, this selection of examples focuses on the difficult stories and disregards the more positive ones; this focus on risk prevention is appropriate to the sensitivity of the topic. Even more interesting is the fact that one can learn from this example why story telling is an integral part of data science and not just a tacked on sales point. Because if we, the data scientists, do not tell a story which embeds the results into the right context and leads the interpretation, that does not mean that there are no stories. No, the addressee of our efforts will rather tell himself/herself a story, as Philip O’Brien has expressed so appropriately. The problem is that the addressee usually does not know the context the data originates from, the background one has to see in order to get to reasonable interpretations. Thus he will make up a background and unfortunately, this will be more about our addressee, his experiences and history, than about the data. The consequence is that a large part of our efforts will be senseless, because while our results are seen, they are interpreted in the wrong context. Philip O’Brien has not explicitly created this relation to storytelling, but suggested it by his exciting presentation.
Dr. Michael Allgöwer at his presentation on the topic Customer Lifetime Value.