A Linear Regression

André — August 28, 2024

The linear regression is the most fundamental technique used in econometrics, particularly due to its high level of ‘interpretability’.

Econometric modeling attempts to estimate the relationship between a set of independent variables of features, denoted by $X = (X_{1}, X_{2}, \dots, X_{p})$, and a response variable $y$. The parameters $\beta_{p + 1}$ limit the set of possible relationships to those that are linear. The econometrician chooses the features to include and the hypothesized functional form of the model for either inference or prediction purposes. The data is assumed to be the realization of an unobservable data-generating process (DGP) from which we know nothing but the probability distribution of its constituents.

Simple Linear Regression

The simple regression model is used to study the relationship between a single independent variable and a predictor. The population model is assumed to have the following function form:

yi=β0+β1xi+ϵiy_{i} = \beta_{0} + \beta_{1} x_{i} + \epsilon_{i}

where $\beta_{0}$ is the intercept of the model, $\beta_{1}$ shows the average change in the dependent variable associated with a one unit change in the value of the regressor, and the error is represented by $\epsilon_{i}$. The data does not usually stands on a perfect line, the error term $\epsilon_{i}$ represents the deviations from the linear DGP, which are assumed to follow the Normal distribution. This error term, however, is more than simple noise when it comes to estimation. Most estimation methods for the linear regression aim to minimize a function of the errors. In this article, we focus on the Ordinary Least Squares (OLS) method that minimizes the residual sum of squares (RSS) defined as

RSS=i=1Nei2=i=1N(yiβ0β1xi)2\begin{split} RSS &= \sum_{i = 1}^{N} e_{i}^{2} \\ &= \sum_{i = 1}^{N} (y_{i} - \beta_{0} - \beta_{1} x_{i})^{2} \end{split}

The Ordinary Least Squares method was developed in an era where computational methods were not powerful enough to minimize the least absolute deviations. The Least Squares method is also guaranteed to have at most one solution if the number of observations is greater than the number of independent variables.

The OLS estimators for the intercept and the slope are defined as the set of input values at which the minimum of $RSS(\beta_{0}, \beta_{1})$ is found.

β0^,β1^=argminβ0,β1RSS(β0,β1)=argminβ0,β1i=1N(yiβ0β1xi)2\begin{split} \hat{\beta_{0}}, \hat{\beta_{1}} &= \text{argmin}_{\beta_{0}, \beta_{1}} RSS(\beta_{0}, \beta_{1}) \\ &= \text{argmin}_{\beta_{0}, \beta_{1}} \sum_{i = 1}^{N} (y_{i} - \beta_{0} - \beta_{1} x_{i})^{2} \end{split}

The solution for the first coefficient is found by setting the first derivative of the residual sum of squares with respect to $\beta_{0}$ equal to zero.

dRSS(β0,β1)dβ0=2i=1N(yiβ0β1xi)0=i=1Nyinβ0β1i=1Nxi\begin{split} \frac{d RSS(\beta_{0}, \beta_{1})}{d \beta_{0}} &= - 2 \sum_{i = 1}^{N} (y_{i} - \beta_{0} - \beta_{1} x_{i}) \\ 0 &= \sum_{i = 1}^{N} y_{i} - n \beta_{0} - \beta_{1} \sum_{i = 1}^{N} x_{i} \end{split}

The solution is then:

β0=1ni=1Nyi1nβ1i=1Nxi=yˉβ1xˉ\begin{split} \beta_{0} &= \frac{1}{n} \sum_{i = 1}^{N} y_{i} - \frac{1}{n} \beta_{1} \sum_{i = 1}^{N} x_{i} \\ &= \bar{y} - \beta_{1} \bar{x} \end{split}

The derivation of the estimator for $\beta_{1}$ follows the same step as the previous derivation, take the derivative and set it equal to $0$.

dRSS(β0,β1)dβ1=i=1N2xi(yiβ0β1xi)0=i=1Nxiyi+β0i=1Nxi+β1i=1Nxi2\begin{split} \frac{d RSS(\beta_{0}, \beta_{1})}{d \beta_{1}} &= \sum_{i = 1}^{N} -2 x_{i} (y_{i} - \beta_{0} - \beta_{1} x_{i}) \\ 0 &= -\sum_{i = 1}^{N} x_{i} y_{i} + \beta_{0} \sum_{i = 1}^{N} x_{i} + \beta_{1} \sum_{i = 1}^{N} x_{i}^{2} \end{split}

The solution for the intercept is substituted into the equation,

β1i=1Nxi2=i=1Nxiyi(yˉβ1xˉ)i=1Nxi=β1i=1Nxi2β1xˉi=1Nxi=i=1Nxiyiyˉi=1Nxi\begin{split} \beta_{1} \sum_{i = 1}^{N} x_{i}^{2} &= \sum_{i = 1}^{N} x_{i} y_{i} - (\bar{y} - \beta_{1} \bar{x}) \sum_{i = 1}^{N} x_{i} \\ &= \beta_{1} \sum_{i = 1}^{N} x_{i}^{2} - \beta_{1} \bar{x} \sum_{i = 1}^{N} x_{i} \\ &= \sum_{i = 1}^{N} x_{i} y_{i} - \bar{y} \sum_{i = 1}^{N} x_{i} \\ \end{split}

resulting in the following solution for the slope coefficient:

β1=i=1Nxiyiyˉi=1Nxii=1Nxi2xˉi=1Nxi\beta_{1} = \frac{\sum_{i = 1}^{N} x_{i} y_{i} - \bar{y} \sum_{i = 1}^{N} x_{i}}{\sum_{i = 1}^{N} x_{i}^{2} - \bar{x} \sum_{i = 1}^{N} x_{i}}

Multivariate Linear Regression

In the multivariate case, the model can better be represented using matrix notation in the following way:

y=Xβ+ϵy = X \beta + \epsilon

where $y$ in a vector of size $n$, $X$ is a $n \times m$ matrix, $\beta$ is $m$-vector, and $\epsilon$ is an other vector of size $n$. $m$ is simply the number of independent variables $p$ plus one, thus the matrix $X$ contains $p$ regressors and a vector of ones to estimate the intercept. In matrix notation, the coefficients are also estimated by minimizing the sum of squared residuals $e’e = (y - X \beta)’(y - X \beta)$ with respect to $\beta$.

ϕ(β)=(yXβ)(yXβ)=yyyXββXy+βXXβ=yy2βXy+βXXβ\begin{split} \phi (\beta) &= (y - X \beta)' (y - X \beta) \\ &= y'y - y'X \beta - \beta' X'y + \beta' X'X \beta \\ &= y'y - 2 \beta' X' y + \beta' X' X \beta \end{split}

To find the minimum, the derivative of $\phi(\beta)$ must be set equal to $0$.

ddβ()=2Xy+2XXβ=02XXβ=2Xyβ=(XX)1Xy\begin{split} \frac{d}{d \beta}(\dots) = -2X'y + 2 X'X \beta &= 0 \\ 2X'X \beta &= 2X'y \\ \beta &= (X'X)^{-1}X'y \end{split}

The LinearAlgebra package makes it easy to manipulate vectors and matrices in Julia. Once the package is loaded, and the variables are defined, the next step is to create a function that takes two arguments, the response variable $y$ and the matrix $X$ of regressors with a vector of ones as the first column. The coefficients are calculated according to the least squares formula derived above.

function lm(y::Vector, X::Matrix)

    beta = inv(X'X)X'y
    return beta
end

In the previous algorithm, the matrix $X$ is assumed to include a vector of ones as the first column, and the matrix of regressor in the next columns. The Ordinary Least Squares regression does not require the presence of an intercept to estimate the remaining coefficients.

X=[1x11x1p1x21x2p1xn1xnp]X = \begin{bmatrix} 1 & x_{11} & \dots & x_{1p} \\ 1 & x_{21} & \dots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & \dots & x_{np} \end{bmatrix}

It is common practice, however, to include an intercept in the regression. The significance of its coefficient will tell us whether it is useful to keep the intercept in the model. The significance of a coefficient in a linear regression can be tested using a t-statistic. This test statistic has the following form:

tβ^=β^βse(β^)where H0:β=0t_{\hat{\beta}} = \frac{\hat{\beta} - \beta}{\text{se}(\hat{\beta})} \qquad \text{where} \ H_{0}: \beta = 0

The null hypothesis presented above is the standard approach in many statistical packages and simply tests for the significance of the coefficient in the model. This test can easily be extended to test for the significance of any (null) hypothesis.

Tstat = (beta - 0) ./ betaSE

I will not present many of the derivations because I do not know or understand them, yet. The formula for $\text{Var}(\hat{\beta})$ is taken from Cross Validated.

The formula of the t-statistic includes a term that has not yet been defined, the standard error of the coefficients $\text{se}(\hat{\beta})$. First of all, the standard error is simply the square root of the variance of the estimated coefficients. The variance of $\hat{\beta}$ depends on the variance of the error term and the inverse of the $X’X$ matrix.

Var(β^)=σ^ϵ2(XX)1\text{Var}(\hat{\beta}) = \hat{\sigma}^{2}_{\epsilon} (X'X)^{-1}

The variance of the error term is defined as

σ^ϵ2=RSSnk\hat{\sigma}^{2}_{\epsilon} = \frac{RSS}{n - k}

where the residual sum of squares is divided by the difference between the number of observation (number of rows) and the number of regressors + $1$ (number of columns) in the $X$ matrix.

  n = length(y)
  k = size(X, 2)
  e = y - (X * beta)

  s2 = e'e / (n - k)
  betaSE = sqrt.(diag(inv(X'X) * s2))

The t-statistic is useful when it can be associated with a probability. The likelihood of the hypothesized coefficient $\beta$ can be calculated using the Student’s t-distribution. The probability of interest is $P(x > \lvert t_{\beta} \lvert )$, or the area that is beyond the absolute value of the t-statistic. In Julia, the algorithm to compute those probabilities can be created using the Distributions package.

using Distributions

function pt(T, df = 100) 

   pdf = TDist(df)
   Pvalue = 1 - cdf(pdf, T)
   return Pvalue
end

Once the pt()function is created, the significance of each coefficient is calculated by adding this next line to the original lm() function. The null hypothesis $H_{0}: \beta = 0$ that no actual relationship exists between the dependent variable and the regressor of interest, and the lower the p-value, the more confidently we can reject this statement.

pvalue = [pt(t, n - 1) for t in tstat]

Once all the aforementioned elements are defined, the function returns the vectors of coefficients, standard errors, t-statistics, and p-value in a dataframe. As an example, the output of a linear regression of two random variables looks as follows:

 Row  beta     betaSE     tstat    pvalue  
      Float64  Float64    Float64  Float64 
─────┼──────────────────────────────────────
   1  4.9808   0.0642877  77.4767      0.0
   2  1.96143  0.0313061  62.6535      0.0

The complete algorithm can be found on GitHub. The algorithm is, of course, much simpler than the actual lm() functions of R, Python, or Julia.

← Back to all posts

A Linear Regression - August 28, 2024 - André