Influence

Influence of data is a measure of the effect that a point has on our fitted model. Two indications of the influence are the residuals and the leverage of the points. Leverage is the distance from the mean of the x values of the data to the x value of the point. The size of the residual is the y distance from the point to the fitted line.

We could remove one point at a time and refit the model then measure the difference between the two sets of fitted values. We denote the fitted values with the ith value removed as \( \boldsymbol{\hat{Y_{-i}}} \). \[ \boldsymbol{\hat{Y}} - \boldsymbol{\hat{Y_{-i}}} \] The above formula would give us the difference in the fitted values after removing the ith data point. We could then make this into a distance as shown below. \[ D_i = (\boldsymbol{\hat{Y}} - \boldsymbol{\hat{Y_{-i}}})^{\intercal}(\boldsymbol{\hat{Y}} - \boldsymbol{\hat{Y_{-i}}}) \] This is a valid method however Cook derivied an analytical method which is gives us a similar measurement called cooks distance. \[ \text{Cook's Distance: } D_i = \frac{(\boldsymbol{\hat{Y}} - \boldsymbol{\hat{Y_{-i}}})^{\intercal}(\boldsymbol{\hat{Y}} - \boldsymbol{\hat{Y_{-i}}})}{(p + 1)\hat{\sigma}^2} \] Cook's distance can be calculated: \[ D_i = \frac{\hat{\epsilon^{*}}_{i}^{2} h_{ii}}{(p + 1)(1 - h_{ii})} \] Where \(h_{ii}\) is the ith element of the diagonal of the hat matrix. Recall the hat matrix is used to calculate the fitted y values. \[ \hat{\boldsymbol{Y}} = \boldsymbol{H}\boldsymbol{Y} \] \[ \hat{Y}_1 = H_{11}Y_1 + H_{12}Y_2 + ... + H_{1n}Y_n \] By writing out the fitted values as above we can see that the H values are scaling the effect each response variable has on the fitted values. These H values are the leverages that we talked about earlier.
        

R-Code

library(tidyverse) y <- data[,3] X <- model.matrix(y~x, data = data) n <- nrow(X) XtX1 <- t(X) %*% X XtX2 <- crossprod(X, X) XtX <- XtX2 Xty <- crossprod(X, y) H <- X %*% solve(XtX, t(X)) cd_cont_pos <- function(leverage, level, model) {sqrt(level*length(coef(model))*(1-leverage)/leverage)} cd_cont_neg <- function(leverage, level, model) {-cd_cont_pos(leverage, level, model)} ggplot(data = data) + geom_point(aes(x = lev, y = stdRes))+ stat_function(fun = cd_cont_pos, args = list(level = 4/n, model = fit3.lm))+ stat_function(fun = cd_cont_neg, args = list(level = 4/n, model = fit3.lm))+