R Tips and Tricks - Part 1

R is the Open-Source All-rounder with a Difficult Learning Curve

Approximately three years ago, I switched from a commercial statistics solution (that was similar to SPSS) to R. I can now say with conviction that I don't need another tool for advanced analytics. Especially in combination with IDE "R-Studio", the software has now reached a level of maturity that allows it to be used in big data science projects without any concerns.

There is, however, no need to delude oneself that one can install R easily and get started immediately. The learning curve is comparatively steep because there are multiple ways to do things due to the variety of packages, amongst other reasons. Frequently, I was annoyed during my evaluation when I was suddenly tripped up by a trivial step and this meant I had to research how to solve the problem in R before continuing. Therefore, in this introduction (hopefully with many more parts to follow), I would like to present some tips and tricks that I would have appreciated knowing when I started.

Column Selection in R

Apparently different spellings

Let's start with something trivial: the selection of columns in data frames. Certainly this is dealt with in R's introductory material, but quite honestly, I wasn't always convinced by their explanation. R has a total of three distinct syntaxes that let you select a column:

data['spaltenname']
data[['spaltenname']]
data$spaltenname

This always confused me in the beginning. The syntax data$columname is certainly preferred, because you don't have to handle annoying brackets and quotation marks. Even more important is the auto-complete feature included with R-Studio, which is invoked by pressing CTRL+SPACE. That's the only way it works.

I constantly find myself in the situation where I have the name of a column as a string in a scalar available, e.g. if I let a loop run over a list of variables. In the past, as a beginner's mistake, I chose the syntax data[columnNameString] instead of data[[columnNameString]], just because I was used to it from other programming languages. Sometimes it works perfectly, but most of the time an error is returned.

The syntax with double brackets is technically identical to the syntax with $. The syntax with single brackets, however, is not really a way to extract a column from a data frame. In tutorials and in a book, I found the explanation that with this syntax a new data frame, with the specified columns within it, is returned. This is only right insofar as it does not necessarily have to be a data frame. It can also be another data container that supports this syntax.

Background of the selection with simple brackets

But what is this for? Why should I extract with data['columnName'] a copy of the original data frame with only one column? In this form, the literal is actually rather useless, because it is used as a special case in another application.

R offers the possibility to select a column subset of a data frame, with the above syntax, e.g. data[c(column_A', column_B', column_C')]or else just using the indices, e.g. data[c(1,2,89,99)] or data[1:10].

That's the way R works-there are no actual variables in the sense of a typified singular value. Instead, there are scalars, which means vectors that have elements of the same type. If you type "columnName<-'column_a'" you generate a character-vector with a length of 1, which is equivalent to "column<-c('column_a')".

dummy<-'dummyText'
dummy2<-c('dummyText')
identical(dummy, dummy2) ## Ergebnis True

Thus, the result of data['columnName'] becomes clearer since it is equal to data[c('columnName')].

Cross tables correctly labeled

In the past, in order to use the auto-complete in R-Studio, I formulated cross-tabulations like this: table(dataset$columnA~dataset$columnB). I was annoyed that I always had to think through what is stored in the columns and in the rows when looking at the output, because the annotation was missing. With the knowledge from above, it is actually clear why. The data from columnA and columnB are extracted from the data set as a vector, which itself doesn't have a name.

But if I write the cross-tabulation like this: table(dataset[c('columnA', 'columnB')]) then I do not transfer two anonymous vectors, but a data frame with two columns that keep their names. In this case, I do get the annotation.

Best practice for column filters

In conclusion, I would like in this context to give a best practice recommendation. It is best to immediately forget about the syntaxes that use direct reference to indices (such as data[1:10]), since they are prone to errors and can be unreadable. Instead, I always define a vector with a column name, e.g. predictors_relevant = c('Predictor01','Predictor03','Predictor05','Predictor09') and then select only using this vector, e.g., data[predictors_relevant]. This makes the code dynamic and by using descriptive names I immediately know what I am actually selecting.

Similarly, I don't believe in using the minus syntax to excude columns. Instead, I use setdiff and then all of the concerned columns are readable in plaintext. For example, if I would like to use all predictors except 'predictor03': Data[setdiff(predictors_relevant, c('Predictor03'))].