Rohan #5: What are bias units?

Rohan Kapur
A Year of Artificial Intelligence
5 min readApr 8, 2016

--

This is the fifth entry in my journey to extend my knowledge of Artificial Intelligence in the year of 2016. Learn more about my motives in this introduction post.

This post assumes sound knowledge of neural networks. If that does not describe your current understanding, no worries. You can catch up here.

It’s almost been a month since my last article on A Year of Artificial Intelligence. Why? It’s quite simple — I’m preparing for some of the most important examinations in my high school student life thus far. These results will largely contribute to the “predicted” scores sent to universities. As a result, I’ve had less time to work on articles for this blog (even though I’ve had a ton more cool ideas).

I recently answered a question on Quora asking about the purpose of bias units in an Artificial Neural Network, which I’m sure some readers of AYOAI may also be curious about. Hence, I’ve posted it on here — sorry if it’s a bit basic!

After I complete my exams, we’re going to be looking at Recurrent Neural Networks, Convolutional Neural Networks, Markov Chains, Independent Component Analysis/Principal Component Analysis, and more. We’re also going to explore them in the context of recent stories in AI such as Go and Tay.ai! Then, we’ll look at further case studies, mostly from recent research papers such as DeepMind’s Neural Turing Machines (a differentiate Turing machine that can write programs!) and bush fire recognition from satellite imagery.

In a typical artificial neural network, each neuron/activity in one “layer” is connected — via a weigh — to each neuron in the next activity. Each of these activities stores some sort of computation, normally a composite of the weighted activities in previous layers.

A bias unit is an “extra” neuron added to each pre-output layer that stores the value of 1. Bias units aren’t connected to any previous layer and in this sense don’t represent a true “activity”.

Take a look at the following illustration:

http://ufldl.stanford.edu/wiki/images/thumb/4/40/Network3322.png/500px-Network3322.png

The bias units are characterized by the text “+1”. As you can see, a bias unit is just appended to the start/end of the input and each hidden layer, and isn’t influenced by the values in the previous layer. In other words, these neurons don’t have any incoming connections.

So why do we have bias units? Well, bias units still have outgoing connections and they can contribute to the output of the ANN. Let’s call the outgoing weights of the bias units w_b. Now, let’s look at a really simple neural network that just has one input and one connection:

Let’s say act() — our activation function — is just f(x) = x, or the identity function. In such case, our ANN would represent a line because the output is just the weight (m) times the input (x).

http://www.mathsteacher.com.au/year10/ch03_linear_graphs/03_equation/Image3673.gif

When we change our weight w1, we will change the gradient of the function to make it steeper or flatter. But what about shifting the function vertically? In other words, what about setting the y-intercept. This is crucial for many modelling problems! Our optimal models may not pass through the origin.

So, we know that our function output = w · input (y = mx) needs to have this constant term added to it. In other words, we can say output = w · input + w_b, where w_b is our constant term c. When we use neural networks, though, or do any multi-variable learning, our computations will be done through Linear Algebra and matrix arithmetic eg. dot-product, multiplication. This can also be seen graphically in the ANN. There should be a matching number of weights and activities for a weighted sum to occur. Because of this, we need to “add” an extra input term so that we can add a constant term with it. Since, one multiplied by any value is that value, we just “insert” an extra value of 1 at every layer. This is called the bias unit.

http://natekohl.net/media/bias-net.gif

From this diagram, you can see that we’ve now added the bias term and hence the weight w_b will be added to the weighted sum, and fed through activation function as a constant value. This constant term, also called the “intercept term” (as demonstrated by the linear example), shifts the activation function to the left or to the right. It will also be the output when the input is zero.

Here is a diagram of how different weights will transform the activation function (sigmoid in this case) by scaling it up/down:

http://natekohl.net/media/sigmoid-scale.png

But now, by adding the bias unit, we the possibility of translating the activation function exists:

http://natekohl.net/media/sigmoid-shift.png

Going back to the linear regression example, if w_b = 1, then we will add bias · w_b= 1 · w_b = w_b to the activation function. In the example with the line, we can create a non-zero y-intercept:

http://prep.math.lsa.umich.edu/pmc/images/14.1.r.1.gif

I’m sure you can imagine infinite scenarios where the line of best fit does not go through the origin or even come near it. Bias units are important with neural networks in the same way.

--

--