Blog Posts: Data Science & AI

Blog

Data Science & AI

Nahaufnahme von Händen auf einer Laptop-Tastatur

Howto: Splitting Files With Standard Python Scripts

19.8.2015

8.5.2025

Howto: Splitting Files With Standard Python Scripts

Ready-Made Data Sets Which Explode the Limits

I am frequently confronted with raw data that is provided to me for analysis and which, when uncompressed, can easily encompass files of half a gigabyte or more. Starting from one gigabyte and over, the desktop-supported statistics tools slowly become strained. There are, of course, tool options for only selecting part of the columns, or only loading the first 10,000 lines, etc.

But what should you do when you only want to take a random sample from the data provided? You should never rely on the file being randomly sorted. It may already have gained systematic sequence effects due to processes in the database export. It also may be the case, that you only want to analyse a tenth of a grouping, such as the purchases made by every tenth customer. To this end, the complete file has to be read as otherwise it is impossible to ensure that all of the purchases of the filtered customers are taken into account.

27.5.2015

8.5.2025

Howto: Easy Web Scraping With Python

Overwhelming Offer in the Webshop

Two weeks ago, a frequently used online mail-order company, whose reminds of a river in South America, called my attention to a campaign by a friendly information email. Namely, three music CDs from a large selection were offered to me for 15€.

As in the past, I still enjoy buying music on physical sound carriers and decided to have a closer look at the offer. It turned out that approx. 9,000 CDs were offered on about 400 pages in the online shop. This shop provides the possibility to sort the offers by popularity or customer ratings. However, if I view the popularity in descending order, I find many titles which do not quite correspond to my age group. On the other hand, if I sort the offers by customer ratings, it turns out that the shop processes the ratings in an unweighted manner. That means that a CD with only one 5 star rating is listed above another CD with 4.9 stars over 1,000 ratings.

High Performance (Mental) Exercise With R

22.10.2015

8.5.2025

High Performance (Mental) Exercise With R

This article deals with the following three questions on a high level and very briefly:

What does a data-driven person think when he hears contentions?
Which tool is more practical for data analyses: R, Python, Java, MATLAB?
Can sporting disciplines be the next application area for data analyses and machine learning

11.3.2015

8.5.2025

Missing Values in Logistic Regression

In addition to decision trees, logistic regression is the workhorse in the modelling in order to forecast the occurrence of an event. Fortunately, both methods are designed in a way that one can basically use any kind of predictor for the prediction, whether dichotomous categories, multi-level categories or continuous variables on interval scale level. Especially the logistic regression, however, has no possibility to reasonably deal with missing values. In social science research or market research, one often makes do with limiting analyses to complete data sets.

8.1.2015

8.5.2025

Looking Into the Data Science Toolbox

Let us have a joint look into our toolbox in this blog entry. The topic provides material for more than one blog entry and we will get back to it time and again in this blog.

The Monty Hall Problem in 10 Python Lines

17.12.2014

8.5.2025

The Monty Hall Problem in 10 Python Lines

Background of the Problem

Many will remember the game show "Let´s make a Deal" from the 90ies where candidates had to choose one of three gates. Behind one gate, the prize was always hidden, and behind the other gates lurked blanks, i.e. the Zonk or, in the USA with presenter Monty Hall, goats. At the start, the candidate always chooses a gate behind which he believes the prize to be hidden. Then, the presenter can try to change the candidate´s mind to other gates by offering cash. He can also open gates in order to increase the excitement.

The Basic Ideas behind Recommendation Systems

20.11.2018

8.5.2025

The Basic Ideas behind Recommendation Systems

What to consider before starting the Development

Recommendation systems are a crucial part of every digital business model. This blog post concisely answers two foundational questions:

Who should care about recommendation systems and why?
What are the primary flavors of recommendation systems? How much work is it to implement them?

In this article I focus on a solid overview.

13.5.2015

8.5.2025

R Tips and Tricks - Part 1

R is the Open-Source All-rounder with a Difficult Learning Curve

Approximately three years ago, I switched from a commercial statistics solution (that was similar to SPSS) to R. I can now say with conviction that I don't need another tool for advanced analytics. Especially in combination with IDE "R-Studio", the software has now reached a level of maturity that allows it to be used in big data science projects without any concerns.

There is, however, no need to delude oneself that one can install R easily and get started immediately. The learning curve is comparatively steep because there are multiple ways to do things due to the variety of packages, amongst other reasons. Frequently, I was annoyed during my evaluation when I was suddenly tripped up by a trivial step and this meant I had to research how to solve the problem in R before continuing. Therefore, in this introduction (hopefully with many more parts to follow), I would like to present some tips and tricks that I would have appreciated knowing when I started.

A Basket Full of Snakes: Python Modules for Data Science

21.1.2016

8.5.2025

A Basket Full of Snakes: Python Modules for Data Science

Anyone who knows my former blogs knows that I am a big fan of both R and Python in daily work.

As powerful as R is in terms of functionalities for data analysis and modeling, as quickly is the motivation subdued in case of "number crunching" when RAM runs at maximum.

In this context, a nice server installation with a lot of metal (e.g. 96Gig-RAM) works wonders.

As this option is not always available, I have made a virtue of necessity and turned towards the more performant alternative, namely the Python based R alternatives, especially since I have been using Python for ETLs and data preparation for a long time.

Data Science & AI

Howto: Splitting Files With Standard Python Scripts

Ready-Made Data Sets Which Explode the Limits

Howto: Easy Web Scraping With Python

Overwhelming Offer in the Webshop

High Performance (Mental) Exercise With R

This article deals with the following three questions on a high level and very briefly:

Missing Values in Logistic Regression

Looking Into the Data Science Toolbox

The Monty Hall Problem in 10 Python Lines

Background of the Problem

The Basic Ideas behind Recommendation Systems

What to consider before starting the Development

R Tips and Tricks - Part 1

R is the Open-Source All-rounder with a Difficult Learning Curve

A Basket Full of Snakes: Python Modules for Data Science

Munich

Berlin

Cluj

Dusseldorf

Frankfurt

Hamburg

Nuremberg

Vienna

Zurich

Zurich

Nürnberg

Munich

Basel

Cluj

Vienna – Postal address

Vienna – Visitor address

Frankfurt

Düsseldorf

Hamburg

Berlin