## What does multivariate linear regression do?

• the same as simple linear regression but with more independent variables (predictors): $$y = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \beta_3 \cdot x_3 + \cdots + \epsilon$$
• $$\epsilon \sim N(0, \sigma)$$ - normally distributed with same $$\sigma$$
• linear-’ refers to the regression coefficients $$\beta_i$$, nonlinear functions of one or more $$x_i$$ can be fitted!
• function: lm() (as before)

## Multivariate linear regression using lm()

Simulate data, independent variables:

x1 = runif(10, 0, 100)
x2 = runif(10, 10, 200)
x3 = runif(10, 100, 400)
cor.test(x1,x2)$p.value; cor.test(x1,x3)$p.value; cor.test(x2,x3)\$p.value
## [1] 0.2151276
## [1] 0.7329013
## [1] 0.3232681

The $$x_i$$ must not be (strongly) correlated!

Let’s simulate some reponse:

y = 3 + 2*x1 + 3*x2 + 1*x3 + rnorm(10, 0, 2)   # simulated response

Let’s see if we can “rediscover” the true coefficients above:

mdata = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3)
head(mdata)
##          y         x1        x2       x3
## 1 557.7894 75.9815207  51.58790 246.8139
## 2 720.2404 97.3230971  68.20277 319.2263
## 3 597.0158 66.6150526  55.27691 293.9997
## 4 600.0651  0.8985721 114.35953 251.0833
## 5 562.8531 26.4637249 113.62655 167.9717
## 6 331.0584 30.2783098  52.04472 109.3102
lm.res = lm(y ~ x1 + x2 + x3, data = mdata)    # Additive model, Wilkinson-Rogers notation
summary(lm.res)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = mdata)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -2.5867 -0.9367  0.2019  0.7856  2.2606
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.356328   2.434827     2.2   0.0701 .
## x1          1.983107   0.019371   102.4 5.85e-11 ***
## x2          2.989270   0.020393   146.6 6.80e-12 ***
## x3          0.999827   0.008325   120.1 2.25e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.737 on 6 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999
## F-statistic: 2.272e+04 on 3 and 6 DF,  p-value: 1.491e-12`

• Coefficients/Estimates displays the calculated values for the $$\beta_i$$
• in our example, the true values should be resembled
• Coefficients/Pr(>|t|) are the p-values for the T-tests
• $$H_0: \; \beta_i = 0$$
• if the p-value for some $$\beta_i$$ is low, the corresponding slope is significantly different from 0
• that means that the response actually depends on that predictor
• R squared (coefficient of determination) describes how good the fit is. A value close to 1 is good.

• The p-value for the F statistic should be small. That means that the fit makes sense, i.e. that the variation by regression is much higher than the variation by (random) error

## More possibilities

• continuous and categorical predictor (independent) variables possible
• interaction between variables