In this post I will explain how a standard neural network learns. More specifically, how weights are adjusted so that the network can be trained to perform a task. If you are familiar with the jargon, I am referring specifically to the standard kind of deep learning called supervised learning. Supervised learning: the use of labelled pairs of inputs and outputs to train a neural network. For example, if the function of the network is to convert paintings to photographs, a training data pair would consist of an image of a handmade painting to be used as input, and the paired output would be a photograph of the exact same scene/depiction. With a large number of these pairings (collectively called a training dataset), the network would learn how to perform this task in general, and can perform its task on new paintings its never seen before. To explain this concept, we will return to the same diagram from the previous post. When referring to a variable that is rotating through the units in a layer, we will say the ith unit in the input layer, the jth unit in the hidden layer and the kth unit in the output layer. How do they work?The neural network trains (learns) by performing three steps and repeating:
Over many iterations, the network learns and becomes better at performing its task. Forward PropogationAn iteration of training will begin by feeding the first training input into the input layer of the neural network. The type of input will depend on the network, it could be an image (which is really just a special matrix), a numerical vector, a number, in some special kinds of networks it can even be text. The standard form of an input is a vector but for now let's focus on a single numerical input. If you recall from the first fundamentals post, each unit in the network is a mathematical function that computes an output given an input. Each unit in the input layer will calculate an output value which will be passed to the next layer. That next layer will receive these values as their input that is to say, the output of the first layer became the input for the second layer. Each subsequent layer will calculate and pass on the results to the next layer, until it eventually reaches the last layer. The output of the last layer is output of the network. This process of passing the values forward to calculate the output of the network is called forward propagation. Even though we have calculated the output of our network from a given training input, how does this help the network learn? For this to become a learning process, we need a way of measuring how well the network been trained in comparison to theoretically perfect performance. This measurement can be done by a loss function. Loss and loss FunctionsLoss is a mathematical quantity that measures the difference between the expected and actual outputs of a neural network. The equation used to calculate this quantity is called the loss function (AKA objective function). Because networks can be trained to accomplish a variety of tasks, there are a variety of loss functions used to quantify their learning. Common examples of loss functions are the Mean Squared Error and Mean Absolute Error which are different ways of comparing the expected and actual outputs. There are a variety of other loss functions for networks whose task is not to produce numerical output per se but for example to classify images between some number of categories (a classic example is cat vs noncat). The most common is called Cross Entropy Loss. Since this post is not meant to emphasize these mathematical details, I will leave that for a later post. Recap So Far
BackpropagationRemember that after the first forward propagation, the output of the network will very likely be inaccurate, resulting in a high loss, indicating that the network is still useless. We need a strategy of changing the weights in the network so that the loss systematically decreases. The challenge is that there can be many weights connected in a complex fashion so that it is hard to predict how changing a weight will affect the loss of the network. We need to determine this relationship between the weights and the loss so that we know whether to increase or decrease a given weight. Mathematically, this relationship is represented by the derivative of the loss (E) with respect to the matrix containing all the weights (W): \( \frac{dE}{dW} (1) \) To develop this strategy, we will need to define another variable, delta ($\delta$). The delta of each unit tells you how the loss will change based on increases or decreases to each of the weights conencted to the unit. Quantitatively, the delta is the derivative of the error with respect to the input to that unit. \( \delta_k = \frac{dE}{du_k} (2) \) Where $u_k$ is the variable representing the input to unit k, see appendix 1 for formal definition. For a more complete explanation of where delta comes from, see appendix 2. Defining delta is helpful because we can use it to determine the relationship between the loss and the specifc weight connecting the units ($w_{jk}$). This relationship is given by the derivative of the loss with respect to the weight: \( \frac{dE}{dw_{jk}} = \delta_k \cdot (value \: passed \: from \: unit \: j) \: (3) \) Where $w_{jk}$ is the single weight between hidden unit j and output unit k. Note that this can be generalized to any unit k and a unit j from the previous layer. Knowing the derivative $\frac{dE}{dw_{jk}}$ only requires the value passed forward from unit j (which is determined in forward propagation) and the delta. This derivative tells us how to change the weights to achieve a decrease in loss! We are able to keep repeating this calculation, adjusting the weights bit by bit until we achieve the lowest loss possible. The delta term can be calculated for each unit in the final output layer. We can pass the deltas back from the output layer towards the input, which allows us to caluclate the deltas for that adjacent layer. We continue this process, one layer at a time, with each layer passing deltas to the next layer (towards the input). Eventually this will find the delta term for every unit in the entire network! We call this backpropagation because it is a similar process to forward propagation, but instead of passing each unit output value "forward" to the next layer, we pass the delta term "backwards" to the previous layer. For those of you who are so inclined, see appendix 3 and 4 for an explanation of why backpropogation works this way. If you are not interested in the math, you can just take my word for it, it won't affect your ability to understand my future deep learning blogs. With all the delta values, we can finally solve equation (3) and find how each weight should be changed to minimize the loss! Here is a diagram (my apologies for the poor drawing) showing how the deltas are first calculated in the output layer and backpropogated through the nework. For the sake of clairty, only one unit is shown passing deltas per layer. Here is an expanded recap of what we covered in this post:
conclusionI apologize for this post being so long and complicated, but these concepts form a cycle and I do not think they can be properly understood in isolation. But, thankfully we have covered the meat and potatoes of deep learning, the rest is just some cool details that some very smart people have figured out over the years that lead to faster training times, better accuracy etc. In the next post, I plan to cover (5) from the recap above more in depth, using a concept called gradient descent. For now try to let that all soak in! Thanks for reading, Rick Sugden Appendix 1: explaining $u_k$Just to explain what the $u_k$ means, recall that the input to a given unit is the weighted average of all the values passed to it. The equation below relates the input to a unit (uk) to the values passed from the n units in the previous layer (z) using the weights connecting each unit in the previous layer to our unit k ($w_{nk}$). $$u_k = \sum_{n=1}^{j} z_n \cdot w_{nk}\tag{1.1}$$ Appendix 2: defining the variable deltaFor this example, I'll discuss the effect of a single weight between a hidden unit j and an output unit k on a single output unit's loss. This is generalizable because you could repeat the process for each ouptut unit's loss and each weight in the network. Recall that we want to determine the derivative of the error with respect to a weight. $$ \frac{dE_{k}}{dw_{jk}} $$ Where error is a function of the input (u) and the input is a function of the weights (w) between the jth and unit k. $$E_k = f(u_{k}(w_{jk}, w_{j+1, k ...}))\tag{2.1}$$ We can apply the chain rule to this partial derivative: $$\frac{dE_k}{dw_{jk} } = \frac{dE_k}{du_{k} } \cdot \frac{du_{k} }{dw_{jk} } \tag{2.2}$$ Using the equation for $u_k$, the derivative is equal to the value of the unit j, which we will call z, since it is a hidden unit (it could have been x if we were talking about the input layer): $$\frac{du_{k}}{dw_{jk}} = z_j\tag{2.3}$$ We now have: $$\frac{dE_k}{dw_{jk}} = \frac{dE_k}{du_{k}} \cdot z_j\tag{2.4}$$ So we can define delta ($\delta$) as: $$\delta_k = \frac{dE_k}{du_{k}}\tag{2.5}$$ Such that our equation is $$\frac{dE_k}{dw_{jk}} = \delta_k \cdot z_j\tag{2.6}$$ To recap, z is the value being passed by unit j, this is calculated and stored during forward propogation. If we can determine delta, we have solved our problem of relating the weights to the loss. This equation can be generalized where k could be any unit and j could be any unit that passes values to k. Appendix 3: determining delta of output unitSince loss is a function of the expected output (training data) and the actual output (y) as a function of ($u_k$): $$E_k = \sum_{n=1}^{N} (y_{expected}  y(u_k))^2\tag{3.1}$$ Noting that $$\delta_k = \frac{dE_k}{du_{k}}\tag{3.2}$$ Clearly, $\delta_k$ is easily solvable given that y(u) must always be chosen as a differentiable function. Appendix 4: determining delta of hidden unitsWe are now looking to find the delta for a hiddenlayer unit j. Again we will discuss a single generalizable weight between some unit i from the previous layer (in this case, the input layer). But because the delta of a hidden unit affects multiple output units, we will use the total loss E, rather than $E_k$.
This means we are interested in a slightly different derivative, we want to relate the loss of a unit to the weight connecting units i and j: $$\frac {dE}{dw_{ij}}$$ By repeating the process in appendix 2, we get: $$\frac {dE}{dw_{ij}} = \delta_j \cdot x_{i}\tag{4.1}$$ $$\begin{equation}\frac {dE}{dw_{ij}} = \delta_j \cdot x_{i}\end{equation}\tag{4.1}$$ Where x is a value passed from the input layer since unit j is now from the hidden layer, and $\delta_j$ is now: $$\delta_j = \frac{dE}{du_j}\tag{4.2}$$ This can be exanded by chain rule: $$\delta_j = \frac{dE}{dz_{j}} \cdot \frac{dz_{j}}{du_j}\tag{4.3}$$ The term on the far right is easily obtainable once we realize that the function of z(u) is just the sigmoid function and so we can look up its derivative in a table: $$\frac {dz_j}{dw_{ij}} = z_j \cdot(1  z_j)\tag{4.4}$$ To solve the $\frac {dE}{dz_j}$ we need to acknowledge that each hidden unit (z) affects all the output units. $$\frac {dE}{dz_j} = \sum_{n=1}^{K} \frac {dE_k}{dz_j}\tag{4.5}$$ $$\frac {dE}{dz_j} = \sum_{n=1}^{K} \frac {dE_k}{du_k} \cdot \frac{du_k}{dz_j}\tag{4.6}$$ We recognize the first term as $\delta_k$! $$\delta_k =\frac {dE_k}{du_k}\tag{4.7}$$ From equation A1, it is trivial to find that: $$ w_{jk}=\frac{du_k}{dz_j}\tag{4.8}$$ therefore $$\frac {dE}{dz_j} = \sum_{n=1}^{K} \delta_k \cdot w_{jk}\tag{4.9}$$ Again, $w_{jk}$ is found during forward propogation so we can simply state that the relationship between a weight and loss on one layer depend on the delta of thext layer! $$\frac {dE}{dw_{ij}}\propto \sum_{n=1}^{K} \delta_k\tag{4.10}$$
0 Comments

AuthorHi, I'm Rick, welcome to my site! I'm a highly curious science student with interests in biology, physics, deep learning, and tech. Outside of school I enjoy athletics, Jazz, banana trees and more! ArchivesCategories 