Boosting for the Naive Bayes Classifier

There are many areas in which neuroscience and machine learning overlap. One of these is the combining of learning during several learning episodes with small success in order to eventually use a merged, stronger, learned model for a particular task. In machine learning, this process is referred to as "boosting". The development of solutions of this kind is a very interesting topic, in particular in the IT industry; thus, a short introduction to machine learning is provided below which presents the basic ideas and the application of the naïve Bayes classifier in R.

Basic Ideas

Especially in the 90ies, the idea of combining various models from learning episodes for a particular kind of task was intensively discussed in the areas of neuroscience and machine learning (see References). At first, I will shortly and intuitively present these learning processes below:

A task is defined which expects input-specific answers from the learner. Sample inputs are given to the learner together with their expected answers for the training (this typically is a classification and/or regression task in machine learning).
In the course of a first learning episode, the task is learned with small and/or "weak" success. This means, that the number of errors is still not small enough in order to be satisfied with the learned model. Thus, further learning episodes are required and/or tried.
In the course of gradual learning processes, the learner particularly concentrates on considering errors from the previous learning episode. Thus, a newly learned model, which on the whole does not differ substantially from the previous models, is assigned to each learning episode. In this way, learning may potentially become more successful within the episode and due to the previous experiences from the data. The learning effect, however, still won't be on the level where the task is fully mastered. Basically, several weak models learned during the episodes are collected.
After all passed learning episodes, the learner has the complete collected experiences and is thus able to combine and merge all weak models to a boosted learned final model. Optimally, however, this combination has to consider the success rate of the individual weak models. That means: the weakest among the weak are less important for the final decision of the combined model.
The more learning episodes (weak models) are performed and considered, the earlier the learner should be able to reduce the total number of errors within a task "at will". This is the direct consequence of reducing the total variance.

The original algorithm which includes the above logic was designed by Freund and Schapire (1996) and is called "AdaBoost" (stands for "adaptable boosting").

Boosted Naive Bayes Classifier

Which prerequisites need to be given for a model in order to be able to "boost" it? A "weak" learner needs to be used for the learning episodes. That means that the criterion of wide variance has to be fulfilled (e.g. decision trees with very low depth), as the boosting for models with low variance (s. References) is ineffective. Models of Bayes learning were highly esteemed back in the 90ies, i.a. due to their elegant mathematical formulation and experimental consistency. Nowadays, naïve Bayes classifiers are well known and transparent as productive solutions e.g. for the identification of email spam. The naïve Bayes classifier presents itself as "weak" model in order to boost it with the AdaBoost algorithm.

After a short internet research, one finds R packages e.g. for the boosting of decision trees, but not for the naive Bayes classifier. In literatures, there are various variants of the AdaBoost algorithm but this time I decided to program the original "AdaBoost.M1" by Freund and Schapire with the naïve Bayes classifier in R. Due to the R syntax and the R packages, this is really simple and one is quickly thrilled by the idea of comparing the simple classifier with its boosted version.

And here is my comparison...

Problem: classification of two classes {0, 1} with 52 features.

Data source: I have selected 1,000 observations from 52 predictors without missing values and the target variables from the "Weight Lifting Exercises Dataset". These data sets are available here and may be freely selected, processed and used both academically and commercially according to their "CC BY-SA" license, as long as the works based on it are published under the terms of "CC BY-SA".

Data preparation: the data sets were separated from the beginning 70%-30% for training and test, the same seed has always been used, etc.

Comparison criterion: "Accuracy" from the R function "confusionMatrix()".

Results:

1. Normal naive Bayes classifier:

nLearners = 1; Accuracy = 0.6500; 95% CI = (0.5931, 0.7039)

2. Boosted naive Bayes classifier:

nLearners = 3; Accuracy = 0.6900; 95% CI = (0.6343, 0.7419) nLearners = 6; Accuracy = 0.7067; 95% CI = (0.6516, 0.7576) nLearners = 12; Accuracy = 0.7133; 95% CI = (0.6586, 0.7638) nLearners = 24; Accuracy = 0.7167; 95% CI = (0.6620, 0.7670)

The test data result was not unexpected, but nevertheless very pleasing. The good thing is that anyone is able to test my R code with his/her own data. Even the classification method can be very quickly replaced by another one and tested due to the R package "caret".

And below is my R code…

<pre>
#----- Initialisierung -----#

# Notwendige Pakete
library(ggplot2)
library(lattice)
library(MASS)
library(klaR)
library(caret)

# Loeschen vom alten Workspace
rm(list = ls())
gc()

# Arbeitsverzeichnis
setwd("C:/Users/UserName/Documents/ProjectFolder/")

# Parameter
set.seed(79)
dataFile = "bnb_sample_data.csv"
rootModName <- "mod_nb_class_"
extModName <- ".rda"
pDatSelTrain <- 0.7
trainMethod <- "cv"
kFolds <- 5
laplaceCorr <- 1
useKernEstim <- TRUE
plotVarsHists <- FALSE
nLearners <- 24

# Loeschen alter Modelle
file.remove(list.files(pattern = rootModName))

#----- Datenaufbereitung -----#

# Datenbeladung aus CSV-Datei
datFull <- read.csv(file = dataFile, header = TRUE, sep = ",",
                    na.strings = c("NA", ""))

# Dateneigenschaften
nDat <- length(datFull[[1]])
nFeat <- length(datFull)
iPred <- nFeat

# Datenexploration: Histogramme aller Variablen
if (plotVarsHists == TRUE) {
  graphics.off()
  for (i in 1:nFeat) {
    hist(datFull[, i],
         xlab = names(datFull)[i],
         main = paste("Histogram ", i, sep = "")) 
  }
}

# Setzen der Groesse der Trainingsdaten
iDatSel <- floor(pDatSelTrain*nDat)

# Trennung fuer Training und Test
datFull$unif_rnd_val <- runif(nDat, min = 0, max = 1)
datFullSort <- datFull[order(datFull[, (nFeat + 1)], decreasing = TRUE), ]
datTrain <- datFullSort[1:iDatSel, 1:nFeat]
datTest <- datFullSort[(iDatSel + 1):nDat, 1:nFeat]

# Festlegung der Zielvariablen als Klassen
datTrain[, iPred] <- factor(datTrain[, iPred])
datTest[, iPred] <- factor(datTest[, iPred])

# Dateneigenschaften
nDatTrain <- dim(datTrain)[1]
nDatTest <- dim(datTest)[1]

# Zusammenfassung der Trainings- und Testdatensaetze
#summary(datTrain)
#summary(datTest)

#----- Modelltraining -----#

# Modelltuning
modSel <- trainControl(method = trainMethod, number = kFolds)
modTun <- data.frame(.fL = laplaceCorr, .usekernel = useKernEstim)

# Initialisieren der Wahrscheinlichkeiten der Beobachtungen
pObs <- rep(1.0/nDatTrain, nDatTrain)
Beta <- c()

# Trainieren der Modelle mit AdaBoost.M1
for (j in 1:nLearners) {

       # Ziehe Stichprobe mit Zuruecklegen nach den pObs
       indWhgtXj <- sample(seq(1:nDatTrain), nDatTrain, replace = TRUE, prob = pObs)
       wghtXj <- datTrain[indWhgtXj, ]

       # Training des Modells mit Kreuzvalidierung
       modNBClass <- train(classes ~ .,
                           data = wghtXj,
                           method = "nb",
                           tuneGrid = modTun,
                           trControl = modSel)
       #print(modNBClass)
       #print(modNBClass$finalModel)

       # Rechnen der Modellvorhersagen
       predNBClass <- predict(modNBClass, datTrain)

       # Rechnen der Fehlerrate
       Ejt <- abs(as.numeric(as.character(predNBClass))    
                  as.numeric(as.character(datTrain$classes)))
       rm(predNBClass)
       Ej <- sum(pObs*Ejt)

       # Wenn Fehlerrate > 1/2, dann Training mit bereits gelernten Modellen beenden
       if (Ej > 0.5) {
              nLearners <- j - 1
              print(paste("Adaptives Lernen beendet. Error = ", Ej, "; Anzahl Lerner = ",
                          nLearners, sep = ""))
              break
       }
       # Ansonsten Gewichte der Beobachtungen lernen und Modell speichern
       else {
             # Rechnen der Lernrate
             Beta[j] <- Ej/(1.0 - Ej)

             # Reduzieren der Gewichte nur fuer richtige Vorhersagen
             pObs <- (1.0 - Ejt)*Beta[j]*pObs + Ejt*pObs

             # Normalisieren der Wahrscheinlichkeiten der Beobachtungen
             pObs <- pObs/sum(pObs)

             # Speichern des trainierten Modells in eine R-Datei
             thisModFile <- paste(rootModName, j, extModName, sep = "")
             save(modNBClass, file = thisModFile)
             rm(modNBClass)
       }
}

#----- Modelltesting -----#

# Testen der Modelle mit AdaBoost.M1
wghtPredNBClass <- rep(0, nDatTest)
for (j in 1:nLearners){
       # Gelerntes Modell wieder abholen
       thisModFile <- paste(rootModName, j, extModName, sep = "")
       load(thisModFile)

       # Vorhersage des gelernten Modells
       predNBClass <- predict(modNBClass, datTest)
       rm(modNBClass)

       # Kombinieren der Modelle
       if (nLearners >= 2) {
             # Vorhersage mit Gewichtung des gelernten Modells
             wghtPredNBClass <- wghtPredNBClass +
                                log(1.0/Beta[j])*as.numeric(as.character(predNBClass))
       }

       else {
             # Nicht boosted Vorhersagen
             wghtPredNBClass <- as.numeric(as.character(predNBClass))
       }
       rm(predNBClass)
}

# Normierung der Vorhersagen, wenn mehrere Modelle
if (nLearners >= 2) {
       wghtPredNBClass <- wghtPredNBClass/sum(log(1.0/Beta))
}

# Umsetzung wieder in Klassen
totPredNBClass <- as.factor(as.integer(wghtPredNBClass > 0.5))

# Zeigen der Wahrheitsmatrix
print(nLearners)
confusionMatrix(totPredNBClass, datTest$classes, positive = "1")

#----- Ende der Routine! -----#</pre>

References

Boosting and Naïve Bayesian Learning. Elkan, Technical Report UC San Diego, 1997.
Interpretable Boosted Naïve Bayes Classification. Ridgeway et al., KDD-98 Proceedings, 1998.
Maschinelles Lernen. Alpaydin, Oldenbourg Verlag, 2008.
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Maize Expert System. Korada et al., IJIST, 2012.