Skip to main content

Clear visualization of regularization paths with HTML widgets

 

A clear alternative to spaghetti tangle visualization

The glmnet package which implements elastic net regression by Hasti and Tibshirani is one of the workhorses of data science, at least when used with R. The package has some built-in possibilities for visualization. In particular, regularization paths can be displayed graphically. This visualization has its pitfalls, however. If the model contains many variables, the result looks like a garish tangle of spaghetti. It doesn't really help further in understanding the model's behavior. Unfortunately, use of Elasticnet is particularly popular in the case of many or frightfully many variables.

A better visualization is therefore needed. Large numbers of variables impose natural limits on a simultaneous representation of all information, as in a static plot. At some point, the display area simply fills up. One solution is visualization with HTML and JavaScript. It enables mouse input to highlight individual regularization paths in order to reveal their course in the spaghetti tangle. At the same time, it is possible to provide additional information such as the variable name in our case. In this way, the spaghetti tangle becomes decipherable.

Generation of HTML widgets with familiar syntax

One of the drawbacks of R is that, despite an existence of packages for everything, most packages have their own very special command variants. The syntax is far removed from the standardization found in the case of Python (despite package diversity). Any tentative approaches to standardization are therefore welcome. This is the case with visualization in the universe of ggplot2, which is known to almost everyone who works with R. It is the most common R package for data visualization (in addition to the built-in graphics capabilities of R). The spontaneous reference here to a "universe" is perhaps still not clear to everyone. For some time now, ggplot2 has offered an interface which makes it possible to expand the package with additional functionality. Available at www.ggplot2-exts.org is an overview of the already quite extensive collection.

One of the extensions is called "ggiraph" and offers HTML/JavaScript variants of known standard plots. Compared to other extensions which also allow similar realizations, ggiraph has the advantage that a completely new syntax need not be learned, and existing code for ggplot2 plots must be adapted only slightly in order to produce HTML widgets. For me, that's a decisive argument in favor of ggiraph. However, I must admit not having tried alternatives like metricsgraphics, plotly, highcharter or rbokeh. I'm curious to hear of any experiences made here.

Replacement of glmnet visualization with HTML widgets

To solve our specific visualization problem, we must build an alternative to the function in the elasticnet package in order to generate an HTML widget instead of a standard plot. To obtain a final result comparable to the original, it makes sense to adhere to the corresponding plot routine of the elasticnet package and to translate it into the syntax of ggiraph. Unfortunately, elasticnet uses the basic graphics of R instead of ggplot2 for visualization, otherwise this task would be somewhat easier. However, it can also be easily accomplished as such, and leaves time to integrate a few small additional features which further facilitate the overview. Firstly, we have included a possibility of restricting the displayed range of values on the y axis with the help of a parameter named ylimits. Secondly, another parameter named max_abs_coeff_range makes it possible to restrict the variables to be displayed. Recorded are only variables for which the maximum amount of the coefficient lies between max_abs_coeff_range [1] and max_abs_coeff_range [2]. This can be used to break down excessively intricate plots into multiple parts: One for variables which exhibit high coefficients along their regularization path, and one for those which have smaller coefficients. This prevents the regularization paths of the latter variables in the visualization from nestling so close to the X-axis so as to become indistinguishable. We have furthermore integrated parameters named title and ylab which make it possible to set the titles of the graph and Y-axis.

How is the code structured?

The reusable portion of the code is the function plot_regupath. It is very similar to the plot.glmnet function from the glmnet package, as well as the corresponding, internal glmnet function named plotCoef. In addition to the afore-mentioned additional parameters, the function requires the same parameters x (for the fitted model) and xvar (for the display type) as does the plot function from the glmnet package. A more detailed explanation of these parameters can be found in the glmnet documentation.

library(tidyr)
library(ggplot2)
library(ggiraph)
library(htmlwidgets)
library(data.table)
library(glmnet)
plot_regupath <- function (x, xvar = c("norm", "lambda", "dev"), ylab = "Coefficient", ylimits=NULL, title=NULL, max_abs_coeff_range=NULL) 
{
    beta <- x$beta
    lambda <- x$lambda
    df <- x$df
    dev <- x$dev.ratio
    
    which = nonzeroCoef(beta)
    nwhich = length(which)
    switch(nwhich + 1, `0` = {
        warning("No plot produced since all coefficients zero")
        return()
    }, `1` = warning("1 or less nonzero coefficients; glmnet plot is not meaningful"))
    beta = as.matrix(beta[which, , drop = FALSE])
    xvar = match.arg(xvar)
    switch(xvar, norm = {
        index = apply(abs(beta), 2, sum)
        iname = "L1 Norm"
        approx.f = 1
    }, lambda = {
        index = log(lambda)
        iname = "Log Lambda"
        approx.f = 0
    }, dev = {
        index = dev
        iname = "Fraction Deviance Explained"
        approx.f = 1
    })
    
    data_for_plot <- tidyr::gather(data.frame(x_values=index, t(beta)), 
                                   key="variable", value="coefficient", 2:(nrow(beta)+1))
    
    data_for_plot <- as.data.table(data_for_plot)
    data_for_plot[, max_abs_coeff:=max(abs(coefficient)), by=variable]
    
    if (!is.null(max_abs_coeff_range)) {
        data_for_plot <- data_for_plot[max_abs_coeff > max_abs_coeff_range[1] & max_abs_coeff < max_abs_coeff_range[2]]
    }
    
    plot <- ggplot(data_for_plot, aes(x=x_values, y=coefficient, colour=variable)) + 
        # use interactive HTML-Version of a lineplot, with transparent lines, and a tooltip displaying the variable:
        geom_line_interactive(alpha=0.2, aes(data_id=variable, tooltip=variable)) +  
        guides(colour=FALSE) +              # remove the legend
        xlab(label=iname) +                 # set x-axis label
        ylab(label=ylab) +                  # set y-axis label
        theme_bw()                          # change overall appearance to a very reduced one
    
    if (!is.null(ylimits)) {
        plot <- plot + ylim(ylimits)
    }
    
    if (!is.null(title)) {
        plot <- plot + ggtitle(label=title)
    }
    
    
    result <- ggiraph::ggiraph(code={print(plot)}, tooltip_opacity=0.5, tooltip_offy = 10, height_svg=6, width_svg=12, zoom_max=100, hover_css="stroke-width:3;stroke-opacity:1")
    
    return(result)
}

c_working_directory <- ""
c_input_file_name <- "FullData.csv"
c_output_file_name <- "Regularisation_path.html"

c_predictors_to_omit <- c("Rating", "Name", "Nationality", "Club", "Contract_Expiry", "Birth_Date", "Height", "Weight", 
                          "National_Kit", "Club_Position", "Club_Kit", "Club_Joining", "Preffered_Position")

set.seed(4711)
setwd(c_working_directory)

dataset <- fread(c_input_file_name, encoding="UTF-8", sep=",", dec=".")

# exclude irrelevant predictors (and some relevant ones, to keep the dataset small)
relevant_predictors <- names(dataset)
relevant_predictors <- relevant_predictors[!relevant_predictors %in% c_predictors_to_omit] 

dataset[, above_average_ind:=as.integer(Rating>mean(Rating))] # add a 0/1-indicator whether a player is above average
dataset[, height_numeric:=as.integer(gsub("cm", "", Height))]
dataset[, weight_numeric:=as.integer(gsub("kg", "", Weight))]


prediction_formula <- as.formula(paste0("above_average_ind ~ ", paste(relevant_predictors, collapse=" + ")))

train_set_size <- floor(0.8 * nrow(dataset))
data_train <- dataset[sample(.N, train_set_size),]

model_matrix <- model.matrix(prediction_formula, data=data_train)
target <- as.factor(data_train[, above_average_ind])

glmnet_fit <- glmnet(model_matrix, target, alpha=1, family="binomial", standardize=TRUE)

html_plot <- plot_regupath(glmnet_fit, xvar="norm")
saveWidget(html_plot, c_output_file_name) # save plot to file, result is best viewed in Browser

What kind of football data are used in the example?

The functionality of the rest of the code (except for the plot_regupath function) is demonstrated in an example. The data basis originates from the video game FIFA 2017 by Electronic Arts. The data can be downloaded at Kaggle:

https://www.kaggle.com/artimous/complete-fifa-2017-player-dataset-global/version/2

Available for every player are various variables, a selection of which is used in our sample code to predict a binary target variable using logistic regression. We constructed the target variable from the data by assigning 1 to a player if the "Rating" variable is above average, otherwise 0. In other words, we try to identify the especially good players. We reduce data processing and feature engineering to a now barely noticeable minimum, in order to demonstrate the functionality of the visualization using as little code as possible. The graph is finally saved as an HTML file with saveWidget. This can then be viewed in a browser (also see the illustration). Don't forget to move the mouse over it!

[Translate to English:]