Predictive Analytics World, Day Two

The second day of Predictive Analytics World began with a highlight right away. The presentation by Phil Winters was the result of a bet with one of the conference organizers; he had challenged Winters that the latter certainly could not give a presentation about the Internet of Things. Well, the organizer lost this bet big time.

Internet of Things

However, the challenge was great. Phil had set himself the goal of giving a presentation whose methods and data are traceable in detail. Thus, the data had to be publicly available. Now, it is very difficult to find suitable publicly accessible data on the topic "Internet of Things", i.e. sensor data. After weeks of research, he finally found such data in Washington, D.C. There, he located a bike hire place which works similarly to the service offered by Deutsche Bahn in some major German cities. This bike hire place is financed by the District of Columbia, and it is part of the contract that the bike hire place must post the accruing data on the internet for unrestricted public access. This data consists of data from sensors which register the hire and return of bikes at the bikes and the hire points. Each individual hire can thus be traced, including the starting and final points. Here is the link for those who would like to play around with this data themselves: http://www.capitalbikeshare.com/system-data

Anyone who wants to do so, however, is well advised to take Phil´s warning regarding sensor data to heart: one could think it would have to be available in nicely uniform, tidy formats, seeing that they are automatically generated. Anyone who believes that had better think again: the operator has been making changes to the data format for unknown reasons at unpredictable intervals. Thus, the annoying data cleansing requires as much time as it generally does.

The second insight regarding the sensor data is somewhat less surprising: it is rather boring without enrichment. In this context, it is particularly the geographical data which presents itself – it can be acquired without hassle via Google´s API and the matching KNIME nodes (there are ready-made nodes for REST retrievals).

And of course, it is not only possible to enrich and acquire the data, but also to use it for predictions. The aim of the prediction in this case had to do with a very special rule in the contract of the bike hire place with the District of Columbia: if a station is totally empty or totally full (i.e. if either hiring or returning is impossible), the bike hire place must remedy the situation within one hour. In this context, it would, of course, be nice if one could predict this situation, specifically: one hour ahead. This objective could be achieved comfortably and without methodical strains. In this context, some interesting insights were achieved, in particular that the weather was irrelevant to the prognosis. At first glance, this is surprising. It confirms, however, what I, being born and bred in Hamburg, have known for a long time: there is no bad weather, only unsuitable clothing. The cyclists in Washington seem to have a similar attitude.

Realtime Text Analytics

The second extraordinary presentation of the day followed the first immediately. It was a presentation by Baader Bank, a superb small data application which paradoxically won a prize as best big data project. This concerns a system which automatically evaluates Bloomberg financial news and examines the former for news which may trigger major price movements.

This limitation to Bloomberg has three reasons. Firstly, it is a source which is carefully prepared by journalists, so the danger of falling for attempts to manipulate prizes is much lower than if e.g. Twitter or blogs were included. Secondly, a limitation to texts in English language is possible in this way; these texts are further relatively easily and uniformly worded and thus so not trouble the natural language processing software with a large number of pitfalls contained in more colloquial sources. Further, the limitation to Bloomberg with a few hundred thousand news items per day helps to keep the processing within a range of speed which may be referred to as real time. This last point, however, could also be achieved with much larger data volumes; the used hardware is not much more powerful than an upmarket desktop computer.

The results of the text analysis including sentiment are linked to each other in a know-how network. Here, limitations in precision – to which an analysis of sentiments cannot fail to be subject these days – are elegantly cushioned, as interest is not so much in the sentiment itself but in whether the sentiment is rather stable for a particular company in the relevant news item, or whether it is just destabilizing. In the latter case, the traders are warned of an unstable market situation. This stability analysis takes the sentiment propagation within the know-how network into account. If e.g. Apple sues Samsung, this also affects e.g. Samsung´s competitors. I as a mathematician was particularly impressed by the dipping into chaos theory for the stability analysis. It is refreshing to see that someone uses this branch of mathematics for its intended purpose, i.e. for example in order to differentiate chaotic from non-chaotic systems, instead of only using it as a quarry of terms which can serve to impress. In sum, the presentation was another great example for the substance and depth of the event.