8 Regression*

This chapter introduces the regression problem and multiple linear regression. In addition, alternative models like regression trees and regularized regression are discussed.

Packages Used in this Chapter

pkgs <- c('lars')
  
pkgs_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_install)) install.packages(pkgs_install)

The packages used for this chapter are:

  • lars (Hastie and Efron 2022)

8.1 Introduction

Classification predicts one of a small set of discrete labels (e.g., yes or no). Regression is also a supervised learning problem, but the goal is to predict the value of a continuous value given a set of predictors. We start with the popular linear regression and will later discuss alternative regression methods.

Linear regression models the value of a dependent variable \(y\) (also called response) as as linear function of independent variables \(X_1, X_2, ..., X_p\) (also called regressors, predictors, exogenous variables or covariates). Given \(n\) observations the model is: \[y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \epsilon_i \]

where \(\beta_0\) is the intercept, \(\beta\) is a \(p\)-dimensional parameter vector learned from the data, and \(\epsilon\) is the error term (called residuals).

Linear regression makes several assumptions:

  • Weak exogeneity: Predictor variables are assumed to be error free.
  • Linearity: There is a linear relationship between dependent and independent variables.
  • Homoscedasticity: The variance of the error (\(\epsilon\)) does not change (increase) with the predicted value.
  • Independence of errors: Errors between observations are uncorrelated.
  • No multicollinearity of predictors: Predictors cannot be perfectly correlated or the parameter vector cannot be identified. Note that highly correlated predictors lead to unstable results and should be avoided.

8.2 A First Linear Regression Model

Load and shuffle data (flowers are in order by species)

data(iris)
set.seed(2000) # make the sampling reproducible

x <- iris[sample(1:nrow(iris)),]
plot(x, col=x$Species)

Make the data a little messy and add a useless feature

x[,1] <- x[,1] + rnorm(nrow(x))
x[,2] <- x[,2] + rnorm(nrow(x))
x[,3] <- x[,3] + rnorm(nrow(x))
x <- cbind(x[,-5], 
           useless = mean(x[,1]) + rnorm(nrow(x)), 
           Species = x[,5])

plot(x, col=x$Species)
summary(x)
##   Sepal.Length   Sepal.Width     Petal.Length   
##  Min.   :2.68   Min.   :0.611   Min.   :-0.713  
##  1st Qu.:5.05   1st Qu.:2.306   1st Qu.: 1.876  
##  Median :5.92   Median :3.171   Median : 4.102  
##  Mean   :5.85   Mean   :3.128   Mean   : 3.781  
##  3rd Qu.:6.70   3rd Qu.:3.945   3rd Qu.: 5.546  
##  Max.   :8.81   Max.   :5.975   Max.   : 7.708  
##   Petal.Width     useless           Species  
##  Min.   :0.1   Min.   :2.92   setosa    :50  
##  1st Qu.:0.3   1st Qu.:5.23   versicolor:50  
##  Median :1.3   Median :5.91   virginica :50  
##  Mean   :1.2   Mean   :5.88                  
##  3rd Qu.:1.8   3rd Qu.:6.57                  
##  Max.   :2.5   Max.   :8.11
head(x)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 85         2.980       1.464        5.227         1.5
## 104        5.096       3.044        5.187         1.8
## 30         4.361       2.832        1.861         0.2
## 53         8.125       2.406        5.526         1.5
## 143        6.372       1.565        6.147         1.9
## 142        6.526       3.697        5.708         2.3
##     useless    Species
## 85    5.712 versicolor
## 104   6.569  virginica
## 30    4.299     setosa
## 53    6.124 versicolor
## 143   6.553  virginica
## 142   5.222  virginica

Create some training and learning data

train <- x[1:100,]
test <- x[101:150,]

Can we predict Petal.Width using the other variables?

lm uses a formula interface see ?lm for description

model1 <- lm(Petal.Width ~ Sepal.Length
            + Sepal.Width + Petal.Length + useless,
            data = train)
model1
## 
## Call:
## lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length + 
##     useless, data = train)
## 
## Coefficients:
##  (Intercept)  Sepal.Length   Sepal.Width  Petal.Length  
##     -0.20589       0.01957       0.03406       0.30814  
##      useless  
##      0.00392
coef(model1)
##  (Intercept) Sepal.Length  Sepal.Width Petal.Length 
##    -0.205886     0.019568     0.034062     0.308139 
##      useless 
##     0.003917

Summary shows:

  • Which coefficients are significantly different from 0
  • R-squared (coefficient of determination): Proportion of the variability of the dependent variable explained by the model. It is better to look at the adjusted R-square (adjusted for number of dependent vars.)
summary(model1)
## 
## Call:
## lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length + 
##     useless, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0495 -0.2033  0.0074  0.2038  0.8564 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.20589    0.28723   -0.72     0.48    
## Sepal.Length  0.01957    0.03265    0.60     0.55    
## Sepal.Width   0.03406    0.03549    0.96     0.34    
## Petal.Length  0.30814    0.01819   16.94   <2e-16 ***
## useless       0.00392    0.03558    0.11     0.91    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.366 on 95 degrees of freedom
## Multiple R-squared:  0.778,  Adjusted R-squared:  0.769 
## F-statistic: 83.4 on 4 and 95 DF,  p-value: <2e-16

Plotting the model produces diagnostic plots (see ? plot.lm). For example to check for homoscedasticity (residual vs predicted value scatter plot should produce a close to horizontal line) and if the residuals are approximately normally distributed (Q-Q plot should be close to the straight line).

plot(model1, which = 1:2)

8.3 Comparing Nested Models

Here we create a simpler model by using only three predictors.

model2 <- lm(Petal.Width ~ Sepal.Length + 
               Sepal.Width + Petal.Length,
             data = train)
summary(model2)
## 
## Call:
## lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0440 -0.2024  0.0099  0.1998  0.8513 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.1842     0.2076   -0.89     0.38    
## Sepal.Length   0.0199     0.0323    0.62     0.54    
## Sepal.Width    0.0339     0.0353    0.96     0.34    
## Petal.Length   0.3080     0.0180   17.07   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.365 on 96 degrees of freedom
## Multiple R-squared:  0.778,  Adjusted R-squared:  0.771 
## F-statistic:  112 on 3 and 96 DF,  p-value: <2e-16

We can remove the intercept by adding -1 to the formula.

model3 <- lm(Petal.Width ~ Sepal.Length + 
               Sepal.Width + Petal.Length - 1,
             data = train)
summary(model3)
## 
## Call:
## lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length - 
##     1, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0310 -0.1961 -0.0051  0.2264  0.8246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## Sepal.Length -0.00168    0.02122   -0.08     0.94    
## Sepal.Width   0.01859    0.03073    0.61     0.55    
## Petal.Length  0.30666    0.01796   17.07   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.364 on 97 degrees of freedom
## Multiple R-squared:  0.938,  Adjusted R-squared:  0.936 
## F-statistic:  486 on 3 and 97 DF,  p-value: <2e-16

here is a very simple model:

model4 <- lm(Petal.Width ~ Petal.Length -1,
             data = train)
summary(model4)
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length - 1, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9774 -0.1957  0.0078  0.2536  0.8455 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## Petal.Length  0.31586    0.00822    38.4   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.362 on 99 degrees of freedom
## Multiple R-squared:  0.937,  Adjusted R-squared:  0.936 
## F-statistic: 1.47e+03 on 1 and 99 DF,  p-value: <2e-16

Compare models (Null hypothesis: all treatments=models have the same effect). Note: This only works for nested models. Models are nested only if one model contains all the predictors of the other model.

anova(model1, model2, model3, model4)
## Analysis of Variance Table
## 
## Model 1: Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length + useless
## Model 2: Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length
## Model 3: Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length - 1
## Model 4: Petal.Width ~ Petal.Length - 1
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)
## 1     95 12.8                         
## 2     96 12.8 -1   -0.0016 0.01   0.91
## 3     97 12.9 -1   -0.1046 0.78   0.38
## 4     99 13.0 -2   -0.1010 0.38   0.69

Models 1 is not significantly better than model 2. Model 2 is not significantly better than model 3. Model 3 is not significantly better than model 4! Use model 4 (simplest model)

8.4 Stepwise Variable Selection

Automatically looks for the smallest AIC (Akaike information criterion)

s1 <- step(lm(Petal.Width ~ . -Species, data = train))
## Start:  AIC=-195.9
## Petal.Width ~ (Sepal.Length + Sepal.Width + Petal.Length + useless + 
##     Species) - Species
## 
##                Df Sum of Sq  RSS    AIC
## - useless       1       0.0 12.8 -197.9
## - Sepal.Length  1       0.0 12.8 -197.5
## - Sepal.Width   1       0.1 12.9 -196.9
## <none>                      12.8 -195.9
## - Petal.Length  1      38.6 51.3  -58.7
## 
## Step:  AIC=-197.9
## Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length
## 
##                Df Sum of Sq  RSS    AIC
## - Sepal.Length  1       0.1 12.8 -199.5
## - Sepal.Width   1       0.1 12.9 -198.9
## <none>                      12.8 -197.9
## - Petal.Length  1      38.7 51.5  -60.4
## 
## Step:  AIC=-199.5
## Petal.Width ~ Sepal.Width + Petal.Length
## 
##                Df Sum of Sq  RSS    AIC
## - Sepal.Width   1       0.1 12.9 -200.4
## <none>                      12.8 -199.5
## - Petal.Length  1      44.7 57.5  -51.3
## 
## Step:  AIC=-200.4
## Petal.Width ~ Petal.Length
## 
##                Df Sum of Sq  RSS    AIC
## <none>                      12.9 -200.4
## - Petal.Length  1      44.6 57.6  -53.2
summary(s1)
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9848 -0.1873  0.0048  0.2466  0.8343 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.0280     0.0743    0.38     0.71    
## Petal.Length   0.3103     0.0169   18.38   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.364 on 98 degrees of freedom
## Multiple R-squared:  0.775,  Adjusted R-squared:  0.773 
## F-statistic:  338 on 1 and 98 DF,  p-value: <2e-16

8.5 Modeling with Interaction Terms

What if two variables are only important together? Interaction terms are modeled with : or * in the formula (they are literally multiplied). See ? formula.

model5 <- step(lm(Petal.Width ~ Sepal.Length * 
                    Sepal.Width * Petal.Length,
             data = train))
## Start:  AIC=-196.4
## Petal.Width ~ Sepal.Length * Sepal.Width * Petal.Length
## 
##                                         Df Sum of Sq  RSS
## <none>                                               11.9
## - Sepal.Length:Sepal.Width:Petal.Length  1     0.265 12.2
##                                          AIC
## <none>                                  -196
## - Sepal.Length:Sepal.Width:Petal.Length -196
summary(model5)
## 
## Call:
## lm(formula = Petal.Width ~ Sepal.Length * Sepal.Width * Petal.Length, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1064 -0.1882  0.0238  0.1767  0.8577 
## 
## Coefficients:
##                                       Estimate Std. Error
## (Intercept)                            -2.4207     1.3357
## Sepal.Length                            0.4484     0.2313
## Sepal.Width                             0.5983     0.3845
## Petal.Length                            0.7275     0.2863
## Sepal.Length:Sepal.Width               -0.1115     0.0670
## Sepal.Length:Petal.Length              -0.0833     0.0477
## Sepal.Width:Petal.Length               -0.0941     0.0850
## Sepal.Length:Sepal.Width:Petal.Length   0.0201     0.0141
##                                       t value Pr(>|t|)  
## (Intercept)                             -1.81    0.073 .
## Sepal.Length                             1.94    0.056 .
## Sepal.Width                              1.56    0.123  
## Petal.Length                             2.54    0.013 *
## Sepal.Length:Sepal.Width                -1.66    0.100 .
## Sepal.Length:Petal.Length               -1.75    0.084 .
## Sepal.Width:Petal.Length                -1.11    0.272  
## Sepal.Length:Sepal.Width:Petal.Length    1.43    0.157  
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.36 on 92 degrees of freedom
## Multiple R-squared:  0.792,  Adjusted R-squared:  0.777 
## F-statistic: 50.2 on 7 and 92 DF,  p-value: <2e-16
anova(model5, model4)
## Analysis of Variance Table
## 
## Model 1: Petal.Width ~ Sepal.Length * Sepal.Width * Petal.Length
## Model 2: Petal.Width ~ Petal.Length - 1
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)
## 1     92 11.9                         
## 2     99 13.0 -7     -1.01 1.11   0.36

Model 5 is not significantly better than model 4

8.6 Prediction

test[1:5,]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 128        8.017       1.541        3.515         1.8
## 92         5.268       4.064        6.064         1.4
## 50         5.461       4.161        1.117         0.2
## 134        6.055       2.951        4.599         1.5
## 8          4.900       5.096        1.086         0.2
##     useless    Species
## 128   6.110  virginica
## 92    4.938 versicolor
## 50    6.373     setosa
## 134   5.595  virginica
## 8     7.270     setosa
test[1:5,]$Petal.Width
## [1] 1.8 1.4 0.2 1.5 0.2
predict(model4, test[1:5,])
##    128     92     50    134      8 
## 1.1104 1.9155 0.3529 1.4526 0.3429

Calculate the root-mean-square error (RMSE): less is better

RMSE <- function(predicted, true) mean((predicted-true)^2)^.5
RMSE(predict(model4, test), test$Petal.Width)
## [1] 0.3874

Compare predicted vs. actual values

plot(test[,"Petal.Width"], predict(model4, test),
  xlim=c(0,3), ylim=c(0,3), 
  xlab = "actual", ylab = "predicted",
  main = "Petal.Width")
abline(0,1, col="red")
cor(test[,"Petal.Width"], predict(model4, test))
## [1] 0.8636

8.7 Using Nominal Variables

Dummy coding is used for factors (i.e., levels are translated into individual 0-1 variable). The first level of factors is automatically used as the reference and the other levels are presented as 0-1 dummy variables called contrasts.

levels(train$Species)
## [1] "setosa"     "versicolor" "virginica"

model.matrix is used internally to create the dummy coding.

head(model.matrix(Petal.Width ~ ., data=train))
##     (Intercept) Sepal.Length Sepal.Width Petal.Length
## 85            1        2.980       1.464        5.227
## 104           1        5.096       3.044        5.187
## 30            1        4.361       2.832        1.861
## 53            1        8.125       2.406        5.526
## 143           1        6.372       1.565        6.147
## 142           1        6.526       3.697        5.708
##     useless Speciesversicolor Speciesvirginica
## 85    5.712                 1                0
## 104   6.569                 0                1
## 30    4.299                 0                0
## 53    6.124                 1                0
## 143   6.553                 0                1
## 142   5.222                 0                1

Note that there is no dummy variable for species Setosa, because it is used as the reference. It is often useful to set the reference level. A simple way is to use the function relevel to change which factor is listed first.

model6 <- step(lm(Petal.Width ~ ., data=train))
## Start:  AIC=-308.4
## Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length + useless + 
##     Species
## 
##                Df Sum of Sq   RSS  AIC
## - Sepal.Length  1      0.01  3.99 -310
## - Sepal.Width   1      0.01  3.99 -310
## - useless       1      0.02  4.00 -310
## <none>                       3.98 -308
## - Petal.Length  1      0.47  4.45 -299
## - Species       2      8.78 12.76 -196
## 
## Step:  AIC=-310.2
## Petal.Width ~ Sepal.Width + Petal.Length + useless + Species
## 
##                Df Sum of Sq   RSS  AIC
## - Sepal.Width   1      0.01  4.00 -312
## - useless       1      0.02  4.00 -312
## <none>                       3.99 -310
## - Petal.Length  1      0.49  4.48 -300
## - Species       2      8.82 12.81 -198
## 
## Step:  AIC=-311.9
## Petal.Width ~ Petal.Length + useless + Species
## 
##                Df Sum of Sq   RSS  AIC
## - useless       1      0.02  4.02 -313
## <none>                       4.00 -312
## - Petal.Length  1      0.48  4.48 -302
## - Species       2      8.95 12.95 -198
## 
## Step:  AIC=-313.4
## Petal.Width ~ Petal.Length + Species
## 
##                Df Sum of Sq   RSS  AIC
## <none>                       4.02 -313
## - Petal.Length  1      0.50  4.52 -304
## - Species       2      8.93 12.95 -200
model6
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length + Species, data = train)
## 
## Coefficients:
##       (Intercept)       Petal.Length  Speciesversicolor  
##            0.1597             0.0664             0.8938  
##  Speciesvirginica  
##            1.4903
summary(model6)
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length + Species, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7208 -0.1437  0.0005  0.1254  0.5460 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.1597     0.0441    3.62  0.00047 ***
## Petal.Length        0.0664     0.0192    3.45  0.00084 ***
## Speciesversicolor   0.8938     0.0746   11.98  < 2e-16 ***
## Speciesvirginica    1.4903     0.1020   14.61  < 2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.205 on 96 degrees of freedom
## Multiple R-squared:  0.93,   Adjusted R-squared:  0.928 
## F-statistic:  427 on 3 and 96 DF,  p-value: <2e-16
RMSE(predict(model6, test), test$Petal.Width)
## [1] 0.1885
plot(test[,"Petal.Width"], predict(model6, test),
  xlim=c(0,3), ylim=c(0,3), 
  xlab = "actual", ylab = "predicted",
  main = "Petal.Width")
abline(0,1, col="red")
cor(test[,"Petal.Width"], predict(model6, test))
## [1] 0.9696

8.8 Alternative Regression Models

8.8.1 Regression Trees

Many models we use for classification can also perform regression to produce piece-wise predictors. For example CART:

library(rpart)
library(rpart.plot)
model7 <- rpart(Petal.Width ~ ., data = train,
  control = rpart.control(cp = 0.01))
model7
## n= 100 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 100 57.5700 1.2190  
##   2) Species=setosa 32  0.3797 0.2469 *
##   3) Species=versicolor,virginica 68 12.7200 1.6760  
##     6) Species=versicolor 35  1.5350 1.3310 *
##     7) Species=virginica 33  2.6010 2.0420 *
rpart.plot(model7)
RMSE(predict(model7, test), test$Petal.Width)
## [1] 0.182
plot(test[,"Petal.Width"], predict(model7, test),
  xlim = c(0,3), ylim = c(0,3), 
  xlab = "actual", ylab = "predicted",
  main = "Petal.Width")
abline(0,1, col = "red")
cor(test[,"Petal.Width"], predict(model7, test))
## [1] 0.9717

Note: This is not a nested model of the linear regressions so we cannot do ANOVA to compare the models!

8.8.2 Regularized Regression

LASSO and LAR try to reduce the number of parameters using a regularization term (see lars in package lars and https://en.wikipedia.org/wiki/Elastic_net_regularization)

library(lars)
## Loaded lars 1.3

create a design matrix (with dummy variables and interaction terms). lm did this automatically for us, but for this lars implementation we have to do it manually.

x <- model.matrix(~ . + Sepal.Length*Sepal.Width*Petal.Length ,
  data = train[, -4])
head(x)
##     (Intercept) Sepal.Length Sepal.Width Petal.Length
## 85            1        2.980       1.464        5.227
## 104           1        5.096       3.044        5.187
## 30            1        4.361       2.832        1.861
## 53            1        8.125       2.406        5.526
## 143           1        6.372       1.565        6.147
## 142           1        6.526       3.697        5.708
##     useless Speciesversicolor Speciesvirginica
## 85    5.712                 1                0
## 104   6.569                 0                1
## 30    4.299                 0                0
## 53    6.124                 1                0
## 143   6.553                 0                1
## 142   5.222                 0                1
##     Sepal.Length:Sepal.Width Sepal.Length:Petal.Length
## 85                     4.362                    15.578
## 104                   15.511                    26.431
## 30                    12.349                     8.114
## 53                    19.546                    44.900
## 143                    9.972                    39.168
## 142                   24.126                    37.253
##     Sepal.Width:Petal.Length
## 85                     7.650
## 104                   15.787
## 30                     5.269
## 53                    13.294
## 143                    9.620
## 142                   21.103
##     Sepal.Length:Sepal.Width:Petal.Length
## 85                                  22.80
## 104                                 80.45
## 30                                  22.98
## 53                                 108.01
## 143                                 61.30
## 142                                137.72
y <- train[, 4]

model_lars <- lars(x, y)
summary(model_lars)
## LARS/LASSO
## Call: lars(x = x, y = y)
##    Df  Rss      Cp
## 0   1 57.6 1225.20
## 1   2 28.1  549.74
## 2   3 11.6  172.16
## 3   4 10.1  140.94
## 4   5  4.2    5.49
## 5   6  4.1    6.08
## 6   7  4.0    5.14
## 7   8  3.9    6.21
## 8   9  3.9    7.94
## 9  10  3.9    9.87
## 10  9  3.9    7.75
## 11 10  3.9    9.73
## 12 11  3.9   11.58
## 13 10  3.9    9.54
## 14 11  3.9   11.00
model_lars
## 
## Call:
## lars(x = x, y = y)
## R-squared: 0.933 
## Sequence of LASSO moves:
##      Petal.Length Speciesvirginica Speciesversicolor
## Var             4                7                 6
## Step            1                2                 3
##      Sepal.Width:Petal.Length
## Var                        10
## Step                        4
##      Sepal.Length:Sepal.Width:Petal.Length useless
## Var                                     11       5
## Step                                     5       6
##      Sepal.Width Sepal.Length:Petal.Length Sepal.Length
## Var            3                         9            2
## Step           7                         8            9
##      Sepal.Width:Petal.Length Sepal.Width:Petal.Length
## Var                       -10                       10
## Step                       10                       11
##      Sepal.Length:Sepal.Width Sepal.Width Sepal.Width
## Var                         8          -3           3
## Step                       12          13          14
plot(model_lars)

the plot shows how variables are added (from left to right to the model)

find best model (using Mallows’s Cp statistic, see https://en.wikipedia.org/wiki/Mallows's_Cp)

plot(model_lars, plottype = "Cp")
best <- which.min(model_lars$Cp)
coef(model_lars, s = best)
##                           (Intercept) 
##                             0.0000000 
##                          Sepal.Length 
##                             0.0000000 
##                           Sepal.Width 
##                             0.0000000 
##                          Petal.Length 
##                             0.0603837 
##                               useless 
##                            -0.0086551 
##                     Speciesversicolor 
##                             0.8463786 
##                      Speciesvirginica 
##                             1.4181850 
##              Sepal.Length:Sepal.Width 
##                             0.0000000 
##             Sepal.Length:Petal.Length 
##                             0.0000000 
##              Sepal.Width:Petal.Length 
##                             0.0029688 
## Sepal.Length:Sepal.Width:Petal.Length 
##                             0.0003151

make predictions

x_test <- model.matrix(~ . + Sepal.Length*Sepal.Width*Petal.Length,
  data = test[, -4])
predict(model_lars, x_test[1:5,], s = best)
## $s
## 6 
## 7 
## 
## $fraction
##      6 
## 0.4286 
## 
## $mode
## [1] "step"
## 
## $fit
##    128     92     50    134      8 
## 1.8237 1.5003 0.2505 1.9300 0.2439
test[1:5, ]$Petal.Width
## [1] 1.8 1.4 0.2 1.5 0.2
RMSE(predict(model_lars, x_test, s = best)$fit, test$Petal.Width)
## [1] 0.1907
plot(test[,"Petal.Width"], 
     predict(model_lars, x_test, s = best)$fit,
     xlim=c(0,3), ylim=c(0,3), 
     xlab = "actual", ylab = "predicted",
     main = "Petal.Width")
abline(0,1, col = "red")
cor(test[,"Petal.Width"],
    predict(model_lars, x_test, s = best)$fit)
## [1] 0.9686

8.8.3 Other Types of Regression

  • Robust regression: robust against violation of assumptions like heteroscedasticity and outliers (roblm and robglm in package robustbase)
  • Generalized linear models (glm). An example is logistic regression discussed in the next chapter.
  • Nonlinear least squares (nlm)

8.9 Exercises

We will again use the Palmer penguin data for the exercises.

library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm
##   <chr>   <chr>              <dbl>         <dbl>
## 1 Adelie  Torgersen           39.1          18.7
## 2 Adelie  Torgersen           39.5          17.4
## 3 Adelie  Torgersen           40.3          18  
## 4 Adelie  Torgersen           NA            NA  
## 5 Adelie  Torgersen           36.7          19.3
## 6 Adelie  Torgersen           39.3          20.6
## # ℹ 4 more variables: flipper_length_mm <dbl>,
## #   body_mass_g <dbl>, sex <chr>, year <dbl>

Create an R markdown document that performs the following:

  1. Create a linear regression model to predict the weight of a penguin (body_mass_g).

  2. How high is the R-squared. What does it mean.

  3. What variables are significant, what are not?

  4. Use stepwise variable selection to remove unnecessary variables.

  5. Predict the weight for the following new penguin:

    new_penguin <- tibble(
      species = factor("Adelie", 
        levels = c("Adelie", "Chinstrap", "Gentoo")),
      island = factor("Dream", 
        levels = c("Biscoe", "Dream", "Torgersen")),
     bill_length_mm = 39.8, 
     bill_depth_mm = 19.1, 
     flipper_length_mm = 184, 
     body_mass_g = NA, 
     sex = factor("male", levels = c("female", "male")), 
     year = 2007
    ) 
    new_penguin
    ## # A tibble: 1 × 8
    ##   species island bill_length_mm bill_depth_mm
    ##   <fct>   <fct>           <dbl>         <dbl>
    ## 1 Adelie  Dream            39.8          19.1
    ## # ℹ 4 more variables: flipper_length_mm <dbl>,
    ## #   body_mass_g <lgl>, sex <fct>, year <dbl>
  6. Create a regression tree. Look at the tree and explain what it does. Then use the regression tree to predict the weight for the above penguin.