https://qph.ec.quoracdn.net/main-qimg-99f7ea7d0929ed41dbecf67ec51b80b3?convert_to_webp=true

Rohan #3: Deriving the Normal Equation using matrix calculus

Understanding the analytical solution to linear regression.

Rohan Kapur
11 min readFeb 16, 2016

--

This is the third entry in my journey to extend my knowledge of Artificial Intelligence in the year of 2016. Learn more about my motives in this introduction post.

Before I start exploring new algorithms/practices in Machine Learning & Co., I want to first refine my current knowledge. Part of this involves plugging gaps in content I either failed to understand or research during my prior endeavors in Artificial Intelligence thus far. Today, I will be addressing the Normal Equation in a regression context. I am not going to explore Machine Learning in depth (but you can learn more about it in my previous two posts) so this article assumes basic comprehension of supervised Machine Learning regression algorithms. It also assumes some background to matrix calculus, but an intuition of both calculus and Linear Algebra separately will suffice.

Today, we try to derive and understand this identity/equation:

Look’s daunting? Keep reading!

Recall that a supervised Machine Learning regression algorithm examines a set of data points in a Cartesian instance space and tries to define a model that “fits” said data set with high accuracy. In this context, a model is any linear or non-linear function.

Data points in red, residuals in gray, model in blue — http://www.pieceofshijiabian.com/wp-content/uploads/2014/01/Screen-Shot-2014-01-18-at-5.50.07-PM.png

As you can see, the blue line captures the trend (that is, how the data points move across across the instance space) in this two dimensional, noisy example. The term “residuals”, primarily used in Statistics and rarely in Machine Learning, may be new for many of you. Residuals, the vertical lines in gray, are the distances between each of the y-coordinates of the data points and their respective predictions on the model. At TV = 0, our model predicted a value of Sales = 6.5 even though the actual value was Sales = 1 (I’m eye-balling this!). Hence, the residual for this instance is 5.5.

We would intuitively suggest that a smaller residual is preferred over a greater residual, however a model may not (and again, we may not want it to — overfitting) be able to accommodate a change in its output for one data point and keep its output constant for all other data points. So, instead, we say that we want to calculate the minimum average residual. That is, we want to minimize the value of the summation of each data point’s residual divided by the number of data points. We can denote this as the “cost function” for our model, and the output of said function as the “cost”.

The final step is to evaluate how we use a residual in the cost function (the principle unit of the cost function). Simply summing over the residuals (which are mere differences/subtractions) is actually not suitable because it may lead to negative values and fails to capture the idea of Cartesian “distance” (which would cause our final cost to be inaccurate). A valid solution would be wrapping an absolute value over our subtraction expression. Another approach is to square the expression. The latter is generally preferred because of its Mathematical convenience when performing differential Calculus.

OK. Let’s approach this Mathematically. First, we will define our model as:

This is pretty much y = mx + c but in a high-dimensional space such that x is vector of values. The TV sales scatter plot diagram above is two-dimensional and hence has a single x: TV, however this need not always be the case.

Notice that this is the “dot-product” between the vectors Θ and x. We can rewrite this using the conveniences of Linear Algebra notation:

The “transpose” operation (which looks like a value raised to the power of “T”) switches the rows and columns of any matrix. Now, we can get a legal multiplication between vectors Θ and x.

Θ stores the real coefficients (or “weights”) of the input features x and is hence of the exact same dimensionality as x. It is important to note that we prefix the vector x with 1 so that we can add support for a constant term in our model.

Now, we can expect to plug in any value x for TV and expect to get an accurate projection for the number of Sales we plan to make. The only issue is, well, we have to figure out the value(s) of Θ that form an accurate model. As humans, we may be able to guess a few values but a) this will not apply in higher-dimensional applications b) a computer cannot do “guess-work” c) we want to find the most accurate Θ.

x represents any single data point we enter to receive a projection. We can now define X as a vector of all the inputs for the the A-Priori data points that we use to shape our model. Note that each data point is a vector in itself, so X is a large matrix (denoted the “design matrix”) whereby each row (we use the notation X superscript arbitrary index i) is one vector x. y is, conversely, the outputs for these A-Priori data points. We define m as the number of rows in X, or otherwise the number of A-Priori data points. In Machine Learning this is called a “training set”.

I hope it is clear that, here, Θ essentially defines the model. Everything else stays constant. So, to find the most accurate model, we need to find the Θ with the lowest cost (lowest average residual — on top of a function — as stated earlier). Let’s put our cost function in Mathematical notation:

The residual is simply the difference between the actual value and the predicted value (the input values fed through the model):

Earlier we discussed different ways we could interact with the residual to create a principle cost unit. Squared difference is the preferred one:

We square the residuals because of the Mathematical convenience when we use differential Calculus

Inserting this expression in place with the cost function leaves us with:

Notice how we’ve multiplied the overall expression by a half. We do this, as per usual, for Mathematical convenience.

Unfortunately, before we can proceed, we will need to represent the cost function in vectorized (using Linear Algebra) format. Instead of explicitly summing over the square residuals, we can calculate the square residuals with matrix operations.

If we were to collect all the residuals in one vector, it would look somewhat like this:

We can further split this up:

Next:

Now let’s simplify the left vector:

And then:

And bam, we can clearly see that this simplifies to:

But we’re forgetting something important; each residual is squared. This poses a problem because we cannot simply square the overall expression XΘ − y. Why? Put simply, the square of a vector/matrix is not equal to the square of each of its inner values. So — what do we actually want to do? We want to take each value in the vector XΘ − y and multiply it by itself. This is what we can formulate:

If we want to get each value in XΘ − y to multiply by itself, we can use a second XΘ − y and have each value in the first vector multiply by the value at the same index in the second vector. This is a dot-product. Recall that when we formulated our hypothesis/model function, we took the dot-product of two vectors as the multiplication between one of the vectors transposed and the other vector. Hence, to achieve square residuals, we end up with the following expression:

By adding the division term to the front of the expression, we should have the final Linear Algebra/vectorized expression for our cost function:

So, now that we’re able to “score” Θ, the end goal is simple:

In other words, find the Θ that minimizes on the output of the cost function and hence represents the most accurate model with lowest average residual.

So, how shall we approach this? I hope it will become clear through this graph:

It’s convex!

That’s right — in Linear Regression one can prove that Cost(Θ) is a convex function! So, at this point, it seems pretty simple: apply differential Calculus to solve for the least cost. Let’s get started!

We are going to differentiate with respect to, as expected, Θ. At this point, we can drop the division term because when we equate the derivative to zero it will nulled out.

We are yet again going to be simplifying the cost function (RHS), this time using matrix transpose distribution identity specifically (AB)’ ≡ B’A’:

EDIT: I make a mistake here and do (AB)’ = A’B’, so I end up having to rearrange (which would otherwise be illegal due to non-commutative nature of matrix multiplication). This does not matter with respect to determining the final derivative.

At this point, we can expand the two brackets:

Since vectors y and are of the same dimensionality and are both vectors, they (or rather the transpose of one with the other) satisfy the commutative property (which states that a * b = b * a — if you weren’t aware, this property is not the case for higher-dimensional matrices). Hence, the two middle terms can be collected into one term:

Let’s try to remove all the instances of transpose (again, using the distribution identities):

We can remove the last term because, when deriving with respect to Θ, it has no effect and the derivative is zero (since the term Θ is not involved):

Let us evaluate the left term first:

Note that the X terms are simply constants/scalars in this partial derivative, so we can “take it out” (using differentiation rules) like so:

Using the following matrix calculus scalar-by-vector identity:

EDIT: I want to add intuition for the rule here. Recall that a vector multiplied by its transpose results in a vector where each value is squared — but this is not the same as the square of the vector itself. If we were to take the derivative of this vector with respect to the original vector, we apply the basic principle which states that we derive by each value in the vector separately:

That simplifies to:

So now, using this identity, we can solve the derivative of the left term to be:

Let’s move on. The right term is much simpler to solve:

Taking the scalars out

Using the follow matrix calculus conjugate transpose rule:

EDIT: I hope the intuition behind this identity becomes clear using the same principles in the previous explanation.

We can see that:

Now we can combine the two terms together to formulate the final derivative for the cost function:

Again — this excludes the 1/2m division term

Recall that the only turning point — a point where the instantaneous gradient is zero — for a convex function will be the global minimum/maximum of the function. So, we set the derivative to zero and hence find said minimum point:

By setting the derivative of the cost function to zero, we are finding the Θ that attains the lowest cost

Now, we sub in our computed derivative:

With some basic algebra, we’re on our way!:

Slow down, ! Since matrices are not commutative, we can’t isolate Θ by simply “bringing down” the X terms. Instead, we multiply both sides by the inverse of said terms (remember that division is the same as multiplication of the inverse):

The inverse of any value multiplied by the value itself is just 1, so…:

We got it!

Here we have it — finally! This is our solution for Θ where the cost is at its global minimum.

Elephant in the room — why isn’t this preferred over Gradient Descent‽ What an elegant solution this is, you say. Well, here’s the main reason why: computing the seemingly harmless inverse of that (m * n) by (n * m) matrix is, with today’s most efficient Computer Science algorithm, of cubic time complexity! This means that as the dimensions of X increase, the amount of operations required to compute the final result increases in a cubic trend. If X was rather small and especially had a low value for n/wasn’t of high dimensions, then using the Normal Equation would be feasible. But for any industrial application with large datasets, the Normal Equation would take extremely — sometimes nonsensically — long.

I hope that was rewarding for you. It definitely was for me. This was my first step into matrix calculus, which proved to be a stimulating challenge. Next article will be even more daring: a full Mathematical writeup on backpropagation! See you then!

--

--