Linear classification using least-squares

Let’s talk about another application of least-squares in statistical learning applications. Say we have some data, and we want to separate them into groups. We have some data that we separate into training and validation sets that we know what groups they should be in. For data in an n-dimensional space (say, the data is defined by n coordinates, or in even simpler terms, has n features), we need to find hyperplanes of dimension n-1 that will SPLIT the space in half, where everything on one side is in one group, and everything on the other side of the hyperplane is in another (Intuitively, for our purposes, you can see 2D Euclidean space can be separated into two parts by a line, which is a 1 dimensional object, and a 3D euclidean space can be split by a plane, which is a 2D object).

So how do we find said hyperplane to divide up our space? The simplest way, and probably the most common way still just because it’s easy for a first-shot attempt at classification and linear regression, is to employ least-squares.

Let us say we have N_{t} training data points of n dimensions, training points each designated at t_{i,j}, for the ith index training point, jth coordinate . These training points belong to two classes, which we’ll classify as the -1 team and the 1 team. Since hyperplanes are represented most easily in equation a^{T}x = b, where where b is a constant and a is an n-dimensional vector of coefficients, our goal here is to solve for the a coefficients and the b constant.

So let’s arrange the problem thusly; we want our A matrix for the least squares problem to contain our data points, such that

A = \begin{bmatrix} t_{1,1} & t_{1,2} & \hdots & t_{1,n} & 1 \\ t_{2,1} & t_{2,2} & \hdots & t_{2,n} & 1 \\ \vdots & & & & \vdots \\ t_{N_{t}, 1} & t_{N_{t}, 2} & \hdots & t_{N_{t}, n} & 1 \end{bmatrix}

So we have our rows of data points stacked on top of each other, with a column of 1s representing a constant offset of b in our hyperplane equation. The b vector is obviously then just going to be a stack of -1s and 1s for whatever class each row data point is matched to.

And then obviously since this is least squares, our resulting \hat{a} vector of coordinates is going to be

\hat{a} = (A^{*}A)^{-1}A^{*}b

We have our hyperplane, now how do we do validation or classification of new data points? Recall that we classified our points using -1 and 1 teams. Adding a 1 to the tail of each data point like we did before to represent the constant b term in the hyperplane equation which we have in our coefficients, we can use our hyperplane equation’s left side, a^{T}x, and simply classify points depending if the result is negative or positive!

The MATLAB code below provides an example of a hyperplane separating two 2D Gaussian populations, and then trying to use that rule to classify data with a higher variance than we expected. A plot with training data points and the separating hyperplane is generated, and then a print out of percentage of correctly classified points is given at the end.

% generate 2 2D pops
ndim = 2;
pop_size = 20;
u1 = [4, 4];
u2 = [-2, -1];
covar_mat = 5.*eye(2);
pop1 = bsxfun( @plus, randn(pop_size, ndim)*chol(covar_mat), u1 );
pop2 = bsxfun( @plus, randn(pop_size, ndim)*chol(covar_mat), u2 );
figure;
plot( pop1(:,1), pop1(:,2), '*b' );
hold on;
plot( pop2(:,1), pop2(:,2), 'xr' );
% find the hyperplane that goes through the middle using least squares
A = [[pop1; pop2], ones(2*pop_size,1)];
b = [-1.*ones( pop_size, 1 ); 1.*ones(pop_size,1)];
coefs = lscov(A, b);
% plot the coefficient line using that learning data we generated
line_x = [-20; 20];
line_y =  (-coefs(end)-coefs(1).*line_x)./coefs(2);
plot( line_x, line_y, '-k' );
xlim( [-20, 20] );
ylim( [-20, 20] );
% now let's try to classify the new points we generate
covar_mat2 = 8.*eye(2);
new_pop1 = bsxfun( @plus, randn(pop_size, ndim)*chol(covar_mat2), u1 );
new_pop2 = bsxfun( @plus, randn(pop_size, ndim)*chol(covar_mat2), u2 );
combined_newpop = [new_pop1; new_pop2];
new_inds = randperm( (2*pop_size) );
actual_population = b(new_inds);
combined_newpop = combined_newpop( new_inds, : );
newpop_w_ones = [combined_newpop, ones(2*pop_size, 1)];
classify = sum( bsxfun( @times, newpop_w_ones, coefs(:)' ), 2);
classify(classify < 0) = -1;
classify(classify >= 0) = 1;
percent_correct = sum(classify == actual_population)./(2*pop_size)

So what’s wrong with this? The clearest problem is that this method really implies that groups are linearly separable, and a lot of times, it’s not that simple. This method also tends to allow outliers to have really heavy effects on the resulting lines, and generally does poorly for classification of more than 2 groups, no matter how you decide to handle the multiple group problem (1v1 hyperplanes or 1 vs everybody else hyperplanes, either way). Inherently, since linear least squares is optimal ML estimator for Gaussian populations, you are inherently assuming multivariable Gaussian distributed stuff, and even if it is, there’s somewhat of a disregard for prior information, say, covariance matrices or anything else that might help.

Since the follow up is natural and I really want to get this stuff nailed down for myself, next time we will talk about linear discriminant analysis, and then after that, quadratic discriminant analysis. I personally think of LDA and QDA as a more thorough treatment of the linear classification problem by stating assumption and taking in account parameters that may help us, plus QDA has the benefit of generating quadratic lines.

Frequency offset synchronization in wireless communication systems

So last time we went over some simple frame synchronization solutions that don’t really pan out when the channel is bad. Frequency offset is a problem, but the self-referential synchronization more or less dealt with that, though it still fell victim to the frequency selective channel.

So for now, let’s assume we have our frame, and we know where it begins. We need to fix any possible frequency offset problems that could exist in the system. It would be awesome if we had perfectly tuned radios that were dead on and there was no need to any frequency offset fixes, but having that kind of accuracy is obviously impossible, especially with more than a couple devices (like imagine if every single cell phone in New York City had to have perfectly synchronized electronics just to received cell phone signal, I don’t even know how that could be possible).

So why is the frequency offset fix so important? Well, let’s think about it for a second. Assuming we have some sort of magic channel that doesn’t mess with our signal at all, no weird channel, no noise, we can receive a sent signal x[n] with a baseband representation at the received end like

y[n] = x[n] \exp (j2 \pi f_{o} n T_{s} )

f_{o} is the frequency offset and T_{s} is the sampling period. Looks pretty harmless, right? But wait, remember for any sort of QAM, M-PSK signaling, we require knowledge of the phase. We can’t afford to lose that, so you can imagine that the samples as you go down in sample index time start having serious phase changes from the exponential bit in the equation above. In effect, we rotate our symbols some amount of phase, and that completely messes everything up.

How do we go about fixing this? There’s a little something called the Moose algorithm that uses our friend least-squares in a sense. Recall the self-referential training. We sent two identical sequences for training and correlated some N_{t} samples with the N_{t} samples ahead of it to find where the frame started. Looking at these training samples, denoted as t…

t[n] \exp (j2 \pi f_{o} n T_{s} )= t[n+N_{t}] \exp( j2 \pi f_{o} (n+N_{t}) T_{s} )

Since we sent the same thing twice, we can say the first N_{t} samples should be the same as the next ones, but with a certain amount of phase change. comparing the two equations and matching things together, it’s clear that we expect \exp( j2 \pi f_{o} N_{t} T_{s} ) amount of phase change between the two sequences. Well, that’s cool, that’s constant in n. We can set up a sort of trivial least-squares problem then…

\begin{bmatrix} t[n] \\ t[n+1] \\ \vdots \\ t[n+N_{t}-1] \end{bmatrix} x = \begin{bmatrix} t[n+N_{t}] \\ t[n+N_{t}+1] \\ \vdots \\ t[n+2N_{t}-1] \end{bmatrix}

Solving for constant x then gives us how much phase difference there was between the first part and second part of training. We represented the phase change in the form  of \exp( j2 \pi f_{o} N_{t} T_{s} ), so we can take the angle of the complex phase change, divide by 2 \pi N_{t} T_{s}, and we can solve for f_{o}. Then we can correct our entire set of samples, including payload, by just multiplying everything by the proper complex exponential \exp (-j2 \pi f_{o} n T_{s} ).

So Moose algorithm is cool, but there’s some caveats to be aware of. First off, just like all estimates, it is definitely affected by noise. This is more or less resolved by just having better signal power, or having more training samples to try and average away the noise in a sense. HOWEVER, recall we recover frequency offset using an estimated phase offset. Because of the way we’re finding frequency, we are limited to being able to correct for frequencies within \frac{\pm 1}{N_{t}T_{s}}, so if we use too many training samples, we can’t recover a strong frequency offset.

The Moose algorithm is used in Schmidl-Cox synchronization and others as “fine frequency offset correction.” There are schemes to correct for more serious frequency offset, and we will talk about one of them when discussing Schmidl-Cox algorithm as a whole next time.

Below is some quick python code I put together to show that we can estimate the frequency offset pretty easily with just some simple least squares code.

# imports go here
import numpy as np
import matplotlib.pyplot as plt

# functions go here
def moose_alg( samples, fs ):
    num_samples = samples.size
    self_ref_size = num_samples/2
    first_half = np.matrix(samples[:self_ref_size])
    second_half = np.matrix(samples[self_ref_size:])
    phase_offset,_,_,_ = np.linalg.lstsq(first_half.transpose(), second_half.transpose() )
    # use phase offset to find frequency
    freq_shift = np.angle(phase_offset)/2/np.pi/(1/fs*self_ref_size)
    return freq_shift

# main thing goes here
if __name__ == '__main__':
    # some params
    freq_offset = 5
    fs = 10000.0
    # establish our barker codes
    bc13 = np.array([1, 1, 1, 1, 1, -1, -1, 1, 1, -1, 1, -1, 1])
    # let's make our training sequence bc13, flip bc13, bc13, flip bc13
    training = np.hstack( (bc13, bc13[::-1]) )
    training = np.hstack( (training, training))
    # mess up our training using frequency offset and noise
    freq_offset = 5
    n = np.arange( training.size )
    time = n/fs
    freq_off_vec = np.exp(1j*2*np.pi*freq_offset*time)
    received = training*freq_off_vec + np.random.randn(training.size)*.1
    # pass what we have to moose algorithm
    freq_shift = moose_alg( received, fs )
    print freq_shift

Using least-squares for channel estimation pt 2: Python implementation

So I talked about using least-squares to estimate a channel at the receiver of a communication system, given some known training sequence. Now I have code to put it into practice. The code will be posted below, but I’ll start with an explanation of what I have done first up here before delving into the details after the code.

So the code can be broken up as follows. First, I generate a channel of some random values. I generate some Rayleigh fading values using a Gaussian number generator (by the way, I make no claims as to how realistic or unrealistic this channel is, I’m just using it to prove that the estimation deal works. Realistic channels or even existing models for channels can be discussed some other time).

Next, I generate a pseudonoise training sequence of +1 and -1 values that I would send through my communication system. Realistically, I would use a Gold sequence, or a Barker sequence, some special binary sequence that guarantees good properties for synchronization purposes, but I just randomly generate a binary sequence here.

I apply the communication model we discussed before, where the received signal is y[n] = \sum_{l=0}^{L} h[l] x[n-l] + v[n], where h is our discretized channel, x is the sent signal, and v is additive white Gaussian noise.

Now the fun begins. I estimate the channel using least squares exactly as I had modeled in my previous post, creating a Toeplitz matrix using known training values and forming a “b” vector out of actually received values. I use this estimated channel to find an equalizer that best cancels it out by forming a Toeplitz matrix using the estimated channel, and having a Kronecker delta vector for the “b” vector. Finally, I form a Toeplitz matrix out of received values, form a “b” vector out of known training values, and solve for an equalizer directly. Two steps is definitely not better than one when there’s noise to be accounted for.

So here’s the code.

import numpy as np
import matplotlib.pyplot as plt
import scipy.linalg as la

# parameters go here
L = 5 #length of channel
Lf = 8 #length of equalizer
Nt = 22 #length of training sequence
noise_pwr = .01 #AWGN variance
nd = np.floor((L+Lf)/2.0)

# generate channel, random Rayleigh variables
h = np.random.randn(L) + 1j*np.random.randn(L)

# generate pseudonoise training sequence
t = np.random.randn(Nt)
t[t<0] = -1
t[t>=0] = 1

# apply our frequency selection comm model with AWGN
y = np.convolve(h, t) + np.sqrt(noise_pwr)*np.random.randn(L+Nt-1)

# estimate the channel
b = y[L-1:]
col = t[L-1:]
if b.size > col.size:
    append_zeros = b.size - col.size
    col = np.hstack( (col, np.zeros(append_zeros) ))
row = np.flipud(t[0:L])
A = la.toeplitz(col,row)
h_hat,_,_,_ = np.linalg.lstsq(A, b)

# find an equalizer
col = np.hstack( (h_hat, np.zeros(Lf-1)) )
row = np.hstack( (h_hat[0], np.zeros(Lf-1)) )
A = la.toeplitz(col, row)
b = np.zeros(h_hat.size + Lf - 1)
b[nd-1] = 1.0
f,_,_,_ = np.linalg.lstsq(A, b)

# now just solve directly for equalizer
row = np.flipud(y[:nd])
if row.size < Lf:
    row = np.hstack( (row, np.zeros(Lf-row.size)) )
elif row > Lf:
    row = row[:Lf]
b = t
col = y[nd-1:]
if col.size < t.size:
    col = np.hstack( (col,np.zeros(Nt-col.size)) )
elif col.size > t.size:
    col = col[:Nt]
A = la.toeplitz(col,row)
f_d,_,_,_ = np.linalg.lstsq(A,b)

# convolve with original channel to see how our equalizers are
res1 = np.convolve( h, f )
res2 = np.convolve( h, f_d )

# plot the results
plt.figure()
plt.plot( np.arange(h.size), np.abs(h), '-*b', label='actual')
plt.plot( np.arange(h_hat.size), np.abs(h_hat), '-xr',label='estimated')
plt.title( 'channel magnitudes' )
plt.xlabel( 'sample index' )
plt.ylabel( 'magnitude' )
plt.legend()
plt.savefig( 'ls_ch_est_1.png' )

plt.clf()
plt.plot( np.arange(f.size), np.abs(f), '-*b', label='two-step')
plt.plot( np.arange(f_d.size), np.abs(f_d), '-xr',label='direct')
plt.title( 'equalizer magnitudes' )
plt.xlabel( 'sample index' )
plt.ylabel( 'magnitude' )
plt.legend()
plt.savefig( 'ls_ch_est_2.png' )

plt.clf()
plt.plot( np.arange( res1.size), np.abs(res1), '-*b', label='two-step' )
plt.plot( np.arange( res2.size), np.abs(res2), '-xr', label='direct' )
plt.title( 'convolution of channel with equalizers' )
plt.xlabel( 'sample index' )
plt.ylabel( 'magnitude' )
plt.legend()
plt.savefig( 'ls_ch_est_3.png' )

Let’s talk about some of the nuances of why any of this works. All three of the least squares problems are developed such that every problem is overdetermined. I’m going to say again, do NOT do it with an underdetermined system, and if you have a square A matrix, this probably isn’t going to work our well for you. Basically, choose a training sequence that is long enough, and an equalizer that is at least as long as the channel, and shorter than the training sequence by some amount, and you’re probably golden. Realistically, this probably won’t be a problem if you use a preset sequence for channel estimation, like concatenated Barker codes or something.

There are some places where I have if statements for appending zeros or shortening vectors before forming the Toeplitz matrix. Frankly, the matrix created needs to fit the parameters you have on hand, and there may be times where you need to fill in zeros because there’s nothing left in your convolution to express in the linear system.

The number of sequences I choose to delay here is the recommended rule-of-thumb by Dr. John Cioffi at Stanford, like I discussed before. If you change the value to be one more or one less, I’m pretty sure you can get some results that give you a less-than-perfect Kronecker delta for the final plot of channel convolved with equalizer.

Solving for an equalizer in two steps makes the problem easier to visualize, but in general will obtain worse (maybe not by much) results than the direct equalizer solve. The noise effects tend to get worse using two operations.

An example set of plots generated is given below. Remember that the channel and stuff shown here are just magnitudes, so don’t try to visually convolve them, that’s just going to make things confusing.

ls_ch_est_1 ls_ch_est_2 ls_ch_est_3

Using least-squares for channel estimation

This was something I remembered from EE 381V Wireless Communications Lab at UT Austin, taught by Dr. Robert Heath. It’s a nice practical way of showing how useful least-squares is, even for something that doesn’t really suggest its use at first glance.

So let’s say we have a wireless node communicating with another wireless node, using this simple baseband model.

y(t) = \int_{t}^{t+T} h(s) x(t-s) dt + v(t)

where x is what we transmitted, v is a AWGN process, h is the channel in effect with a delay spread of T, and y is the received signal. We sample it, assuming our RF stuff modulates it down nicely, and we satisfy Nyquist so we don’t have aliasing issues,

y[n] = \sum_{l=0}^{L} h[l] x[n-l] + v[n]

So n is our sample index, and m is our discrete convolution dummy variable, L is how long the channel delay spread is in samples (or so we think). So to get our message back at the receiver, assuming noise is small for now, we need to estimate the channel affecting our communications at this very moment. And how do we do that?

If we sent a known training sequence in our transmission, then we can use that to estimate the channel of our system. We can set up a linear system as follows, under the assumption that x is the training sequence t, with N_{t} samples in the training sequence.

\begin{bmatrix} y[L] \\ y[L+1] \\ \vdots \\ y[N_{t}-1] \end{bmatrix} = \begin{bmatrix} t[L] & \hdots & t[0] \\ t[L+1] & \ddots & \vdots \\ \vdots & & \vdots \\ t[N_{t}-1] & \hdots & t[N_{t}-1-L] \end{bmatrix} \begin{bmatrix} a[0] \\ a[1] \\ \vdots \\ a[L] \end{bmatrix}

We can see that we imitated the discrete convolution operator using this Toeplitz matrix. Intuitively, you can see we turned the training around in time, and then lined it up so that it shifts one sample at a time while multiplying and summing, so it is definitely a convolution expressed in linear algebra. Of course, assuming we have enough training samples such that the A matrix is tall, we can solve for h[n] using Least squares.

NOTE: do NOT do channel estimation with a fat matrix. If you recall from another post, we get a solution with a minimum power, and that really doesn’t help us at all here since it’s like saying our channel has several different possible solutions with an overdetermined linear system.

What can we do with this channel? We can solve for a filter that will allow us to cancel out the effects of this channel on our communications, and then use that filter on the rest of our payload. Our goal is to find an equalizer f[n] such that

$latex \sum_{l=0}^{L_{f}} h[l] f[n-l] = \delta[n]$

L_{f} is the length of the equalizer filter we’re making. So we can see we want our filter to cancel out the effects of the channel and give us a Kronecker delta as a result. Realistically, we probably can’t find a great filter that will give us the delta with no time delay, so we can change our criteria such that we solve for, assuming we choose k samples worth of delay time.

\sum_{l=0}^{L_{f}} h[l] f[n-l] = \delta[n-k]

Using this idea, we can set up a similar problem as before, using the Toeplitz matrix for our convolution again.

\begin{bmatrix} h[0] & 0 & \hdots & \hdots \\ h[1] & h[0] & 0 & \hdots \\ \vdots & \ddots & & \\ h[L] & & & \\ 0 & h[L] & \hdots & \\ \vdots & & & \end{bmatrix} \begin{bmatrix} f[0] \\ f[1] \\ \vdots \\ \vdots \\ \vdots \\ f[L_{f}] \end{bmatrix} = \begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ \vdots \\ 0 \end{bmatrix}

Now, we get a filter we can use in our communication system to recover our original signal using least squares again. But two least-squares calculations is kind of a lot of work, why do two when we can do one?

If we change our formulation such that we want to recover the original training sequence with our received channel-affected training sequence, we can formulate the following problem,

\sum_{l=0}^{L_{f}} f[l] y[n-l] = t[n-k]

And just like that, we can solve for our filter in a direct step using the following linear system

\begin{bmatrix} t[0] \\ t[1] \\ \vdots \\ t[N_{t}-1] \end{bmatrix} = \begin{bmatrix} y[k] & \hdots & y[k-L_{f}] \\ y[k+1] & \ddots & \vdots \\ \vdots & & \vdots \\ y[k+N_{t}-1] & \hdots & y[k+N_{t}-L_{f}] \end{bmatrix} \begin{bmatrix} f[0] \\ f[1] \\ \vdots \\ f[L_{f}] \end{bmatrix}

So how do we choose equalizer length, and number of samples to delay time? Rule for the equalizer in general is longer equalizers gives more accurate results in exchange for longer computation, but at the very least, shoot for being equivalent in length to the actual channel. The number of samples to delay time when solving for the equalizer, you usually need to solve for several equalizers using different delay values and choose the best one, but a rule of thumb from Dr. John Cioffi’s textbook is that you the floor of the average of channel length and equalizer length, i.e., \frac{L_{f}+L}{2}, gives pretty decent results.

Additionally, we ignored noise this entire time. What should we do if we have strong noise? Saving it for another post, but we can have a minimum mean square error equalizer if we know the noise power and covariance ahead of time.

So later post, minimum mean square error filter, and programming implementation of the filter in Python and/or MATLAB.

Polynomial Interpolation using Vandermonde matrix and Least Squares

There’s a lot of instances where we want to try to find an interpolating polynomial for a set of data points. Say, we have a set of data points, and decide we want a piecewise spline interpolation to try to smooth things out and make a guess at a polynomial function describing our data. Using the magic of least squares, we can form a least-squares problem to find coefficients for a polynomial of a given size.

Let’s say we have m data points such that we know x and y values for. We’re going to label the y values y_{1}, y_{2}, ..., y_{m}, and we can place them in a column vector \bar{y} for convenience. We know our polynomial is going to have some order n-1, such that our polynomial is of the form a_{n}x^{n-1} + a_{n-1}x^{n-2} + ... + a_{1}, so we have a total of n coefficients to our polynomial. Let’s set up the coefficients in a column vector \bar{a}^{T} = \begin{bmatrix} a_{1} & a_{2} & \hdots & a_{n} \\ \end{bmatrix} where the T is simply a transpose, NOT a Hermitian transpose or anything. Then, we can set up our matrix X like the following.

\begin{bmatrix} 1 & x_{1} & x_{1}^{2} & \hdots & x_{1}^{n-1} \\ 1 & x_{2} & x_{2}^{2} & \hdots & x_{2}^{n-1} \\ & & \vdots & & \\ 1 & x_{n} & x_{n}^{2} & \hdots & x_{n}^{n-1} \\ \end{bmatrix}

You can see that the matrix we are setting up here is a Vandermonde matrix, with powers increasing from 0 to n-1 as we go from left to right. Each row of our X matrix lines up with the corresponding y value for the data point, and we can see the number of columns of X matches the number of polynomial coefficients we are aiming for. So now, we can pose the problem as a linear system as follows:

X \bar{a} = \bar{y}

Very clearly, if we have more data points than coefficients, this has to be modeled as a least-squares problem. The solution comes out to be

\bar{a} = (X^{*}X)^{-1}X^{*}\bar{y}

And our solution contains all of our polynomial coefficients from a_{1} to a_{n}!

So aside from the downsides of interpolation and whatever, what should we be careful of here? Overfitting the polynomial can make for some very poor solutions that don’t really make any sense in the context of the problem at hand, so in general, doing least squares with a tall Vandermonde matrix for this interpolation problem will get better results than a square Vandermonde or an underdetermined problem. The saying that simpler is better has never rung so true. A more sensitive issue specific to computer calculation is that the Vandermonde matrix built can have some poor conditioning, with larger powers ballooning or shrinking and causing instability of solutions. In those cases, an alternative and numerical stable interpolation method using Lagrange interpolation is preferred.

An example plot, and the code is presented below.

poly_fit_demo

import numpy as np
import matplotlib.pyplot as plt

n = 11
interp_n = 51
start = 0
end = 1
order = 4
matrix_dim = (n, order) #cubic polynomial
x = np.linspace(start, end, n)
y = np.abs( np.random.randn(n))
A = np.fliplr(np.vander(x, matrix_dim[1]))
coefs,_,_,_ = np.linalg.lstsq(A, y)
interp_x = np.linspace(start, end, interp_n)
interp_y = np.zeros(interp_n)
for ind, ix in enumerate(interp_x):
    interp_y[ind] = np.sum(coefs * ix**np.arange(0, order))
plt.figure()
plt.plot(interp_x, interp_y, '-b', label='interp line')
plt.plot(x, y, '*r', label='data points')
plt.xlabel( 'x value' )
plt.ylabel( 'y value' )
plt.title( 'poly interp demo' )
plt.savefig('poly_fit_demo.png')

Calculating Least-Squares on a computer

The least-squares solution to an overdetermined problem Ax = b discussed previously is essentially solving

A^{*}Ax = A^{*}b \\ x = (A^{*}A)^{-1}A^{*}b

where x is a vector of n elements, b is a vector of m elements, and A is a matrix with m rows, and n columns. So we can just put that directly into MATLAB, Python, or anything else and get our answer, right?

Well, unfortunately, thanks to numerical computation on a computer having numbers of limited precision, sometimes calculating least-squares directly gives terrible answers. So I have two more ways to do it using routines that are readily available in most computational environments. Other ways to do least-squares exist (iterative, etc.), but we can look at those later.

If we do QR factorization on A, such that A = QR, where Q is an m x n matrix with orthonormal columns, and R an n x n square matrix upper diagonal matrix (meaning anything below the diagonal elements are zeros), we can rewrite the least squares problem like this.

A^{*}Ax = A^{*}b \\ R^{*}Q^{*}QRx = R^{*}Q^{*}b \\ R^{*}Rx = R^{*}Q^{*}b \\ Rx = Q^{*}b

By using QR factorization (that is NOT Gram-Schmidt based, say a QR factorization process that uses Householder reflections or Givens rotations, an entirely different post later probably), we can get something that is more numerically stable, without being over-the-top in terms of additional computation. Rule of thumb from a professor was that if you can get unitary matrices from a backward stable algorithm, chances are your end product will be more stable.

We can also do least-squares by using the singular value decomposition. So if we factorize our A matrix such that A=USV^{*}, where U is an m x n matrix with orthonormal columns, S is a square matrix with non-negative diagonal entries, and zeroes everywhere else, and V is an n x n unitary matrix, we can turn the least squares problem into the following.

A^{*}Ax = A^{*}b \\ VSU^{*}USV^{*}x = VSU^{*}b \\ VSSV^{*}x = VSU^{*}b \\ SV^{*}x = U^{*}b

This is supposed to obtain the most numerically stable results for the least-squares calculation at the cost of computation resources. Because of this, MATLAB and I believe NumPy do least-squares calculations using QR factorization by default to get the user good results without too much work for the computer. Scripting languages need all the speed they can get, anyways. GNU Scientific Library does least-squares calculation by SVD, at least for the C API in version 1.63.

(It is entirely possible I got some of the dimensions mixed up on these REDUCED form factorizations. Corrections are appreciated.)

Interesting bit about Least-Squares

Let’s talk about the least-square solution to linear systems of equations. Let’s say we have the problem

Ax=b

where A is an m-by-n matrix (m rows, n columns), x is an n-element vector, and b is an m-element vector. This is just a standard form of a linear algebra problem. Let’s also assume A is full-rank.

Now’s let’s say m>n, or A is a tall matrix. We call A an overdetermined matrix. Simply put, there are way too many unique linear equations to solve for a unique x vector answer. So what do we do? We can use least-squares formulation of the problem to get some sort of an acceptable answer. If we multiply by sides of the equation by the Hermitian transpose of A, we get

A^{*}Ax = A^{*}b \\ x = (A^{*}A)^{-1}A^{*}b

So we have a least-squares solution for x. For people well-versed in linear algebra concepts, this is essentially the projection of the vector b on the subspace spanned by matrix A. But generally speaking, most people don’t care about that. Why is least-squares good for this case then? The least squares solution we obtained here minimizes the function

\|Ax-b\|^{2}

So the vector x minimizes the error for the equation. We cannot obtain an exact x for our overdetermined system, but this is our best shot, in terms of this squared error measure.

But wait! If n>m, or A is a fat matrix, A is then an underdetermined matrix, and we have a large solution space. The vector x could be a whole number of things! What’s neat about least squares is that we can still use it here for a similar purpose.

x = A^{*}(AA^{*})^{-1}b

So why is this useful? The vector x here is actually the solution to the system with the smallest norm, so \|x\|^{2} is minimized. There are places in optimization or control theory where we want the minimum energy vectors for controllability or stability, so the underdetermined least-squares problem is useful as well.