Title

Influence of data is a measure of the effect that a point has on our fitted model. Two indications of the influence are the residuals and the leverage of the points. Leverage is the distance from the mean of the x values of the data to the x value of the point. The size of the residual is the y distance from the point to the fitted line.

We could remove one point at a time and refit the model then measure the difference between the two sets of fitted values. We denote the fitted values with the ith value removed as \( \boldsymbol{\hat{Y_{-i}}} \). \[ \boldsymbol{\hat{Y}} - \boldsymbol{\hat{Y_{-i}}} \] The above formula would give us the difference in the fitted values after removing the ith data point. We could then make this into a distance as shown below. \[ D_i = (\boldsymbol{\hat{Y}} - \boldsymbol{\hat{Y_{-i}}})^{\intercal}(\boldsymbol{\hat{Y}} - \boldsymbol{\hat{Y_{-i}}}) \] This is a valid method however Cook derivied an analytical method which is gives us a similar measurement called cooks distance. \[ \text{Cook's Distance: } D_i = \frac{(\boldsymbol{\hat{Y}} - \boldsymbol{\hat{Y_{-i}}})^{\intercal}(\boldsymbol{\hat{Y}} - \boldsymbol{\hat{Y_{-i}}})}{(p + 1)\hat{\sigma}^2} \] Cook's distance can be calculated: \[ D_i = \frac{\hat{\epsilon^{*}}_{i}^{2} h_{ii}}{(p + 1)(1 - h_{ii})} \] Where \(h_{ii}\) is the ith element of the diagonal of the hat matrix. Recall the hat matrix is used to calculate the fitted y values. \[ \hat{\boldsymbol{Y}} = \boldsymbol{H}\boldsymbol{Y} \] \[ \hat{Y}_1 = H_{11}Y_1 + H_{12}Y_2 + ... + H_{1n}Y_n \] By writing out the fitted values as above we can see that the H values are scaling the effect each response variable has on the fitted values. These H values are the leverages that we talked about earlier.

Influence

R-Code