Multivariate linear regression

What does multivariate linear regression do?

the same as simple linear regression but with more independent variables (predictors): \(y = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \beta_3 \cdot x_3 + \cdots + \epsilon\)
\(\epsilon \sim N(0, \sigma)\) - normally distributed with same \(\sigma\)
`linear-’ refers to the regression coefficients \(\beta_i\), nonlinear functions of one or more \(x_i\) can be fitted!
function: lm() (as before)

Multivariate linear regression using `lm()`

Simulate data, independent variables:

x1 = runif(10, 0, 100) 
x2 = runif(10, 10, 200) 
x3 = runif(10, 100, 400)
cor.test(x1,x2)$p.value; cor.test(x1,x3)$p.value; cor.test(x2,x3)$p.value

## [1] 0.2151276

## [1] 0.7329013

## [1] 0.3232681

The \(x_i\) must not be (strongly) correlated!

Let’s simulate some reponse:

y = 3 + 2*x1 + 3*x2 + 1*x3 + rnorm(10, 0, 2)   # simulated response

Let’s see if we can “rediscover” the true coefficients above:

mdata = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3)
head(mdata)

##          y         x1        x2       x3
## 1 557.7894 75.9815207  51.58790 246.8139
## 2 720.2404 97.3230971  68.20277 319.2263
## 3 597.0158 66.6150526  55.27691 293.9997
## 4 600.0651  0.8985721 114.35953 251.0833
## 5 562.8531 26.4637249 113.62655 167.9717
## 6 331.0584 30.2783098  52.04472 109.3102

lm.res = lm(y ~ x1 + x2 + x3, data = mdata)    # Additive model, Wilkinson-Rogers notation
summary(lm.res)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = mdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5867 -0.9367  0.2019  0.7856  2.2606 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.356328   2.434827     2.2   0.0701 .  
## x1          1.983107   0.019371   102.4 5.85e-11 ***
## x2          2.989270   0.020393   146.6 6.80e-12 ***
## x3          0.999827   0.008325   120.1 2.25e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.737 on 6 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 2.272e+04 on 3 and 6 DF,  p-value: 1.491e-12

Reading the output

Coefficients/Estimates displays the calculated values for the \(\beta_i\)
- in our example, the true values should be resembled
Coefficients/Pr(>|t|) are the p-values for the T-tests
- \(H_0: \; \beta_i = 0\)
- if the p-value for some \(\beta_i\) is low, the corresponding slope is significantly different from 0
- that means that the response actually depends on that predictor
R squared (coefficient of determination) describes how good the fit is. A value close to 1 is good.
The p-value for the F statistic should be small. That means that the fit makes sense, i.e. that the variation by regression is much higher than the variation by (random) error

More possibilities

continuous and categorical predictor (independent) variables possible
interaction between variables

Multivariate linear regression

What does multivariate linear regression do?

Multivariate linear regression using lm()

Reading the output

More possibilities

Multivariate linear regression using `lm()`