Synopsis

This project deals with the application of a machine learning technique to make predictions on the field of Human Activity Recognition.

The data from the project is available from:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data comes from a study as follows:

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

More details in the paper “Qualitative Activity Recognition of Weight Lifting Exercises” released by the authors of the study.

http://groupware.les.inf.puc-rio.br/work.jsf?p1=11201

Create Training and Cross-Validations Sets

First we set random seed and load necessary library and data.

library(caret)
set.seed(3243)

training <- read.csv("pml-training.csv", na.strings = c("", "#DIV/0!"))
testing <- read.csv("pml-testing.csv")

Then create the sets.

data <- createDataPartition(training$classe, p = 0.7, list = FALSE)
training <- training[data,]
crossValidation <- training[-data,]

Feature selection

A machine learning model “learns” from a set of features (or variables) that produce an outcome. In order to develop a model that is both efficient and effective, the right set of features must be determined. Too many features may have a big computational cost and not contibute to the overall performance of the model. Too few and the model may not be able to find a good fit to predict the outcomes on new observations.

Exploring the datasets it is possible to see that many variables in the test set have only NA values.

The idea is to use all variables in the test set data have some data in them, in order to build the simplest possible model and keep the meaningful variables.

A variable with no values would have zero impact on predictions made on the test set. We will identify those variables in the test set and then remove the same variables from the training set.

Some variables do not seem to be meaningful, like the row index, user name, timestamps and the flag of a new window or not.

The timestamps are removed because the actual time the user perfomed the actions is of no importance for predictions since a new user may perform the action at any time in the future. The underlying structure of the timeseries is still preserved in the data generated by the sensor readings.

#We remove the first index column, the user name, timestamps and new window var from the sets.
training <- training[,-c(1:6)]
testing <- testing[,-c(1:6)]
crossValidation <- crossValidation[,-c(1:6)]

#This function checks, for each variable in the dataset, if the sum of NA values is equal to the length of the variable.

removeNa <- function(dataframe){
  varIndex <- c()
  for (i in 1:length(names(dataframe))){
    if(sum(is.na(dataframe[,i])) == length(dataframe[,i])){
      varIndex <- c(varIndex, i)
    }
  }
  dataframe <- dataframe[-varIndex]
}

Now we apply the function to the test set to remove the variables.

testing <- removeNa(testing)
dim(testing)

## [1] 20 54

The test set now is reduced to 54 variables.

The test set contains a variable “problem_id” that is not present in the training set. All the other variables are removed while keeping the “classe” variable which is our response variable.

#take all variable names left in the test set, except for "problem_id".
varNames <- names(testing[1:53])

#removes variables list from the training and cv set while keeping the response variable.
training <- training[,c(varNames, "classe")]
crossValidation <- crossValidation[,c(varNames, "classe")]

Choosing and fitting a model

As described in the paper, the authors chose the random forest algorithm to perform the classification. I decided to use the same.

We set the parameters for the random forest algorithm to use internal 3 folds cross-validation.

myTrControl <- trainControl(method = "cv", number = 3)
model <- train(classe ~., training, method="rf", trControl = myTrControl)
print(model)

## Random Forest 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 9157, 9159, 9158 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9915555  0.9893173  0.001421753  0.001798840
##   27    0.9954866  0.9942905  0.001203360  0.001522727
##   53    0.9924293  0.9904236  0.001119695  0.001415214
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

Cross-validation and out of sample errors

Next we fit the model on the cross-validation set and calculate the out-of-sample error. The out-of-sample error is one minus the accuracy.

predictions <- predict(model, newdata = crossValidation)
confMatrix <- confusionMatrix(predictions, crossValidation$classe)
print(confMatrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1162    0    0    0    0
##          B    0  817    0    0    0
##          C    0    0  708    0    0
##          D    0    0    0  677    0
##          E    0    0    0    0  758
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9991, 1)
##     No Information Rate : 0.2819     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2819   0.1982   0.1718   0.1642   0.1839
## Detection Rate         0.2819   0.1982   0.1718   0.1642   0.1839
## Detection Prevalence   0.2819   0.1982   0.1718   0.1642   0.1839
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Calculate out-of-sample error.

1 - as.numeric(confMatrix$overall[1])

## [1] 0

The out-of-sample error is 0. A perfect fit to the cross-validation set.

Predictions on the Test set

predict(model, newdata = testing)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical Machine Learning Course Project

Bernardo Dore

13th June 2016