CS 450 - Lecture 7

Linear Least Squares

We have more equations than unknowns, and we want to find a solution that minimizes $\lVert Ax - b \rVert_2$. That is, we want to find

$$ x^* = \arg\min_{x \in \mathbb{R}^n} \lVert Ax - b \rVert_2 $$

where $A \in \mathbb{R}^{m \times n}$ and $b \in \mathbb{R}^m$.

We usually write this as $Ax \cong b$, meaning $x$ is the vector that minimizes the 2-norm of the residual $Ax - b$.

For an optimal solution $x^\*$, we have $r \perp \text{span}(A)$ where $r = Ax^\* - b$. Residual can only be $0$ if $b \in \text{span}(A)$.

Given the SVD of $A = U \Sigma V^T$, and since $U^T$ is orthogonal (hence, an isometry), we can write:

$$ \begin{align*} \lVert Ax - b \rVert_2 &= \lVert U^T (Ax - b) \rVert_2 \\[1em] &= \lVert \Sigma V^T x - U^T b \rVert_2 \end{align*} $$

Letting $y = V^T x$ and $d = U^T b$, then we need to find $y^*$ such that $\lVert \Sigma y - d \rVert_2$ is minimized.

The minimizer of this reduced problem is given by:

$y^*_i = \dfrac{d_i}{\sigma_i}$ for $i \in \\{1, \ldots, r\\}$
$y^*_i = 0$ for $i > r$

where $r = \text{rank}(A)$.

Remember that inverse of a non-singular matrix $A$ is given by $A^{-1} = V \Sigma^{-1} U^T$. Similarly, the pseudoinverse is given by $A^\dagger = V \Sigma^\dagger U^T$, where $\Sigma^\dagger$ is the diagonal matrix with the reciprocals of the non-zero singular values of $A$ on the diagonal, and zeros elsewhere:

$$ \Sigma^\dagger = \text{diag}\left(\dfrac{1}{\sigma_1}, \ldots, \dfrac{1}{\sigma_r}, 0, \ldots, 0\right) $$

Then, $y^* = \Sigma^\dagger d$, and we can recover $x^*$ as:

$$ x^* = A^\dagger b $$

Data Fitting using Least Squares

Given a set of points $(x_i, y_i)$, we want to find a $n-1$ degree polynomial $p(x)$ that minimizes the residuals:

$$ \sum_{i=1}^m (y_i - p(x_i))^2 $$

The polynomial can be expressed as:

$$ p(x) = a_0 + a_1 x + \ldots + a_{n-1} x^{n-1} $$

We can write this as a linear system:

$$ \begin{bmatrix} 1 & x_1 & x_1^2 & \ldots & x_1^{n-1} \\ 1 & x_2 & x_2^2 & \ldots & x_2^{n-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & \ldots & x_m^{n-1} \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ \vdots \\ a_{n-1} \end{bmatrix} \= \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix} $$

Then we can use the least squares method to find the coefficients $a_i$.

Conditioning of Linear Least Squares

Consider a perturbation $\delta b$ to the right-hand side $b$:

$$ A(x + \delta x) \cong b + \delta b $$

The amplification depends on how much of $b$ is in the span of $A$.

$$ \frac{\lVert \delta x \rVert_2}{\lVert x \rVert_2} \leq \kappa(A) \frac{\lVert b \rVert_2}{\lVert Ax \rVert_2} \frac{\lVert \delta b \rVert_2}{\lVert b \rVert_2} $$

Normal Equations

To solve the least squares problem, we can use the normal equations:

$$ A^T A x = A^T b $$

Using the SVD of $A = U \Sigma V^T$, we have:

$$ \begin{align*} (U \Sigma V^T)^T U \Sigma V^T x &= (U \Sigma V^T)^T b \\ \Sigma^T \Sigma V^T x &= \Sigma^T U^T b \\ V^T x &= (\Sigma^T \Sigma)^{-1} \Sigma^T U^T b = \Sigma^\dagger U^T b \\ x &= V \Sigma^\dagger U^T b = x^\star \end{align*} $$

However, this method is more ill-conditioned problem than the original least squares problem. Generally, we have $\kappa(A^T A) = \kappa(A)^2$. This is easily seen by noting that singular values of $A^T A$ are squares of the singular values of $A$.

Solving Normal Equations

If $A$ is full column rank ($\text{rank}(A) = n$), then $A^T A$ is symmetric positive definite (SPD):

Symmetric. $(A^T A)^T = A^T A$.
Positive Definite.
- Since $A^T A = V \Sigma^2 V^T$, we have $\lambda_i(A^T A) = \sigma_i(A)^2 > 0$,
- or equivalently, $x^T A^T A x = \lVert Ax \rVert_2^2 > 0$ for all $x \neq 0$.

Since $A^T A$ is SPD, we can use Cholesky factorization to solve the normal equation $A^T A x = A^T b$.

QR Factorization

QR factorization provides a more stable way to solve the least squares problem.

Given a matrix $A \in \mathbb{R}^{m \times n}$ where $m \geq n$ and $\text{rank}(A) = n$, we can factor it as $A = QR$, where:

$Q \in \mathbb{R}^{m \times m}$ is an orthogonal matrix (i.e., $Q^T Q = I_m$),
$R \in \mathbb{R}^{m \times n}$ is an upper trapezoidal matrix. For $m > n$, first $n$ rows of $R$ are upper triangular with positive diagonal, and the remaining rows are zero.

A reduced QR factorization is defined as $A = \hat{Q} \hat{R}$, where:

$\hat{Q} \in \mathbb{R}^{m \times n}$ has orthonormal columns,
$\hat{R} \in \mathbb{R}^{n \times n}$ is upper triangular.

To solve the least squares problem $Ax \cong b$, we can:

Compute the reduced QR factorization $A = \hat{Q} \hat{R}$.
Multiply both sides by $\hat{Q}^T$ to get $\hat{R} x = \hat{Q}^T b$.
Solve the upper triangular system $\hat{R} x = \hat{Q}^T b$ using backward substitution.

This method is more stable than the normal equations, since multiplying by an orthogonal matrix doesn’t grow the error.

Gram-Schmidt Orthogonalization

The Gram-Schmidt process finds a set of orthonormal vectors with the same span as the columns of $A$.

We can use it to compute the QR factorization of $A$.

The classical Gram-Schmidt process for QR factorization:

for i = 1,…,n
    b_i  = a_i                # start with original column
    for j = 1,…,i−1
        r_ji = <q_j, a_i>     # projection coefficient
        b_i  = b_i − r_ji q_j # subtract projection
    end
    r_ii = ||b_i||
    q_i  = b_i / r_ii
end

The modified Gram-Schmidt process is more numerically stable:

for i = 1,…,n
    b_i = a_i
    for j = 1,…,i−1
        r_ji = <q_j, b_i>
        b_i = b_i − r_ji q_j
    end
    r_ii = ||b_i||
    q_i = b_i / r_ii
end

Here, instead of subtracting the projection from the original vector, we subtract it from the already orthogonalized vector $b_i$.