Linear regression

What does linear regression do?

Fits a straight line to two-dimensional data, i.e. to observed point pairs \((x_i, y_i)\)
The model is \(y_i = \alpha + \beta \cdot x_i + \varepsilon_i\)
… where the \(\varepsilon_i\) are the error terms (the deviations of the observations from the fitted line)
… and \(\alpha\) and \(\beta\) are the slope and the intercept of the fitted line to be determined
\(\alpha\) and \(\beta\) are determined in such a way that the sum of squared errors is minimized: \(\sum_i \varepsilon_i^2 \rightarrow min\)
"linear" refers to the regression coefficients \(\alpha\) and \(\beta\), nonlinear functions \(x \rightarrow y\) can be fitted!
\(\hspace{0.5cm}\) (\(\alpha\) and \(\beta\) are linear combinations of the \(y_i\))
More about regression here
Function: lm() (and others)

Assumptions

Equal variance assumption (homoscledasticity): same \(\sigma\) for all residuals \(\varepsilon_i\)
Residuals \(\varepsilon_i\) (approximately) normally distributed with zero mean (if significance tests are being considered): \(\varepsilon_i \sim N(0, \sigma)\)
residuals \(\varepsilon_i\) statistically independent

Linear regression using `lm()`

Some data:

set.seed(777)
x = seq(1:12)
y = 2*x + rnorm(length(x), mean = 0, sd = 1.5)     # linear, with some noise
plot(x, y, main = "Measurements", font.main = 1)

The plot looks fairly linear, a linear regression can be conducted without any data manipulation:

Let us create a data frame with the observations first:

dat = data.frame(x = x, y = y)
class(dat)

## [1] "data.frame"

head(dat)

##   x         y
## 1 1  2.734679
## 2 2  3.402188
## 3 3  6.766254
## 4 4  7.401782
## 5 5 12.458029
## 6 6 12.931911

We run the function, telling it that \(y\) is to be treated as a function of \(x\):

lm.out <- lm(y ~ x, data = dat)
summary(lm.out)

## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22999 -0.60183  0.00169  0.27844  2.00234 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.74984    0.61946    1.21    0.254    
## x            1.94117    0.08417   23.06 5.31e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 10 degrees of freedom
## Multiple R-squared:  0.9815, Adjusted R-squared:  0.9797 
## F-statistic: 531.9 on 1 and 10 DF,  p-value: 5.307e-10

Reading the output

Coefficients/Estimates displays the calculated values for \(\alpha\) and \(\beta\)
- \(\alpha =\) 0.7498 (the estimated intercept)
- \(\beta =\) 1.9412 (the estimated slope)
Coefficients/Pr(>|t|) are the p-values for the t-test
- null hypothesis \(H_0: \; \beta = 0\), i.e. slope is zero
- if the p-value for \(\beta\) is low, the \(H_0\) is rejected, i.e. the slope is significantly different from 0.
R squared (coefficient of determination) describes how good the fit is. A value close to 1 is good.
The p-value for the F-test should be small. Roughly spoken, that means that the fit makes sense.

Draw the regression line using the `lm()` results

# str(lm.out)
alpha = lm.out$coefficients[1]
beta = lm.out$coefficients[2]
plot(x, y)
abline(a = alpha, b = beta, col = "red")   # abline draws a line with slope b and intercept a

Checking if the assumptions were fullfilled

Yes! This is done afterwards by looking at the residuals (the deviations of the original points from the fitted line).
Check homoscedasticity and normality:

# plot(lm.out)  
residuals = resid(lm.out)  # get the residuals 
par(mfrow = c(1, 2))
plot(residuals, col = "blue", main = "Residuals")
abline(h = 0, col = "red", lty = 2)
qqnorm(residuals, col = "blue")
qqline(residuals, col = "red", lty = 2)

par(mfrow = c(1, 1))

Left plot:
- The residuals should be distributed around the zero-line (we required that the mean of the \(\varepsilon_i\) is zero).
- They should not show any pattern but should be randomly distributed.
Right plot (QQ-plot):
- The points should be close to the straight line which resumes the theoretical quantiles of the normal distribution. That indicates normal distribution of the \(\varepsilon_i\).

Independence of the residuals can be checked using an autocorrelation plot

auto.cor = acf(resid(lm.out))
plot(auto.cor)

The vertical stripes should be within the dashed blue lines
… except for the first one - for zero lag, as a matter of course, the autocorrelation is equal to one!
This indicates statistical indepence of the \(\varepsilon_i\).

We can also directly have a look at some key data of the residuals:

residuals = resid(lm.out)
sd(residuals)     # standard deviation

## [1] 0.9596625

summary(residuals)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.229993 -0.601828  0.001691  0.000000  0.278440  2.002337

Why is plotting necessary?

Technically, lm() will work in most of the cases. But all datasets above yield the same regression line!

Fitting non-linear functions

No problem!

Fitting a square function

Create some data points:

set.seed(777) 
x = seq(1:12)
y = x^2 + rnorm(length(x), mean = 0, sd = 10)  # square function with some noise added
plot(x, y, col = "red", pch = 18)

It would be possible to define a new variable \(\acute{x} = x^2\) and to feed the lm() with this variable.
However, it is probably easier to use Wilkinson-Rogers (W-R) notation:

dat2 = data.frame(y = y, x = x)
lm.out = lm(y ~ I(x^2), data = dat2)
summary(lm.out)

## 
## Call:
## lm(formula = y ~ I(x^2), data = dat2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.432 -3.687 -0.148  2.010 12.776 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.60544    2.92698   1.573    0.147    
## I(x^2)       0.96020    0.04115  23.334 4.73e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.571 on 10 degrees of freedom
## Multiple R-squared:  0.982,  Adjusted R-squared:  0.9802 
## F-statistic: 544.5 on 1 and 10 DF,  p-value: 4.732e-10

Plot original data and the fitted curve:

txt = "Original data + fit"
plot(x, y, col = "red", pch = 18, main = txt)             # original data
c1 = lm.out$coefficients["(Intercept)"]                   # intercept
c2 = lm.out$coefficients["I(x^2)"]                        # coefficient of x^2
curve(c1 + c2*x^2, from = min(x), to = max(x), n = 101, col = "blue", lwd = 1.5, lty = 2, add = TRUE)

We used the curve-function to plot the fit.

Fitting a polynomial function

Create some data points:

set.seed(777) 
x = runif(100, min = -1, max = 5)               # create 100 uniformly distributed numbers between -1 and 5
y = x^3 - 5*x^2 + 2*x - 1 + rnorm(100, 0, 0.5)  # a third order polynomial with some noise added
plot(x, y, col = "red", pch = 18)

Do the fit using the lm-function with W-R-notation:

dat3 = data.frame(y = y, x = x)
fit = lm(y ~ x + I(x^2) + I(x^3))
summary(fit)

## 
## Call:
## lm(formula = y ~ x + I(x^2) + I(x^3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14497 -0.31654 -0.02124  0.30661  1.24284 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.03723    0.07852  -13.21   <2e-16 ***
## x            2.05217    0.10294   19.94   <2e-16 ***
## I(x^2)      -4.97719    0.06695  -74.34   <2e-16 ***
## I(x^3)       0.99243    0.01091   91.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4719 on 96 degrees of freedom
## Multiple R-squared:  0.9915, Adjusted R-squared:  0.9912 
## F-statistic:  3713 on 3 and 96 DF,  p-value: < 2.2e-16

Plot original data and the fitted curve:

c1 = fit$coefficients["(Intercept)"]       # intercept
c2 = fit$coefficients["x"]                 # coefficient of x
c3 = fit$coefficients["I(x^2)"]            # coefficient of x^2
c4 = fit$coefficients["I(x^3)"]            # coefficient of x^3

txt = "Original data + fit"
plot(x, y, col = "red", pch = 18, main = txt)
curve(c1 + c2*x + c3 *x^2 + c4*x^3, from = min(x), to = max(x), n = 101, col = "blue", lwd = 1.7, lty = 2, add = TRUE)

Prediction

Above, as a result of a regression, we obtained an object fit. This object represents a statistical model for the connection of the x- and y-data. We can use this model to predict points at arbitrary positions on the x-axis if they are not too far outside the range of the original data.

In order to do so, we can use the function predict. The points intended for prediction must be available as a dataframe:

to_predict <- data.frame(x = seq(-2, 3, 0.01))
head(to_predict)

##       x
## 1 -2.00
## 2 -1.99
## 3 -1.98
## 4 -1.97
## 5 -1.96
## 6 -1.95

pdout <- predict(fit, newdata = to_predict, interval = "confidence")
head(pdout)

##         fit       lwr       upr
## 1 -32.98977 -34.02908 -31.95045
## 2 -32.65216 -33.68226 -31.62205
## 3 -32.31673 -33.33768 -31.29578
## 4 -31.98348 -32.99532 -30.97164
## 5 -31.65239 -32.65518 -30.64960
## 6 -31.32347 -32.31725 -30.32968

Here, we conducted a prediction using the model fit at datapoints defined by to_predict, simultaneously calculating a 95% confidence interval (the confidence level can be changed, read the full description of predict). The output of the predict-command has 3 columns if interval = "confidence" has been chosen. Here, the column fit contains the fitted points, while lwr and upr denote the lower and upper confidence limit, respectively.

Let us plot the original data together with the predicted points and the confidence limits:

txt = "Original data + prediction + CI"
plot(x, y, col = "red", pch = 18, main = txt, xlim = c(-2, 6), ylim = c(min(pdout[,"lwr"]), max(pdout[,"upr"])))   
lines(to_predict[,1], pdout[,"fit"], col = "blue", lwd = 1.8, lty = 2)
lines(to_predict[,1], pdout[,"lwr"], col = "green", lwd = 1.8, lty = 3)
lines(to_predict[,1], pdout[,"upr"], col = "green", lwd = 1.8, lty = 3)

For this example, the confidence limits are very narrow, which already can be seen in the output of the head-command above. This is caused by the large number of available data points and the weak noise.

Let us try out an example with fewer data points and stronger noise:

Now create only 15 data points and increase the standard deviation of the noise term:

set.seed(888)
x = runif(15, min = -1, max = 5)                      # create 15 uniformly distributed numbers
y = x^3 - 5*x^2 + 2*x - 1 + rnorm(length(x), 0, 2.5)  # a third order polynomial with noise
dat3 = data.frame(y = y, x = x)                       # observed data

We build the model using lm and make the prediction at many intermediate points, and at some points outside the original range:

fit = lm(y ~ x + I(x^2) + I(x^3))                                     # build the model
to_predict <- data.frame(x = seq(-2, 6, 0.01))                        # points to predict 
pdout <- predict(fit, newdata = to_predict, interval = "confidence")  # prediction

We plot original data, prediction, and confidence limits:

txt = "Original data + prediction + CI"
plot(x, y, col = "red", pch = 18, main = txt, xlim = c(-2, 6), ylim = c(min(pdout[,"lwr"]), max(pdout[,"upr"])))   
lines(to_predict[,1], pdout[,"fit"], col = "blue", lwd = 1.8, lty = 2)
lines(to_predict[,1], pdout[,"lwr"], col = "green", lwd = 1.8, lty = 3)
lines(to_predict[,1], pdout[,"upr"], col = "green", lwd = 1.8, lty = 3)

We see that the confidence intervals are wider now. They are largest outside the range of the original data and smallest where many observations are located.

uwe.menzel@matstat.org