# Kawaii LSTMs

###### Take the link. But it’s… it’s not like I like you or anything. Baka!

Slopes are changes in y over changes in x. In calculus, we discover that they are the tangent line to a point on a function.

If you know the inclination of the slope, you know if you are walking up a hill or down a hill, even if the terrain is covered in fog.

The higher the value of the function, the more error it represents. The lower, the less error.

We wish to know the slope so that we can reduce error. What causes the error function to slither up and down in its error are the parameters.

If we can’t feel the slope, we don’t know if we should step to the “right” or to the “left” to reduce the error.

Not necessarily. The problem is that we can end up on the tip-top of a hill and also have a flat slope. We want to end up at a minimum. This means that we must follow a procedure: If negative slope, then move right. And if positive slope, move left. Never climb, always slide.

There is a similar procedural mission going on in a neural network except that the sense of error comes from a higher-dimensional slope called a gradient.

No, its very similar. The gradient tells you the direction of steepest ascent in a multidimensional terrain. Then, you must step towards the negative gradient.

Oh, I didn’t mention that the terrain was multidimensional? Well it is. There is not a single place where the input goes like in the function I initially showed you.

This means that not only is there fog but that the hills and valleys are beyond human comprehension. We can’t visualize them even if we tried. But like the sense-of-error from slope which guides us down a human-world hill, the gradient guides us down to the bottom in multi-dimensional space.

The neural network is composed of layers. Each layer has a landscape to it, and hence its own w of parameters with its own gradient.

Here is the objective function Q(w) for a single layer:

The goal is to plug in a lucky set of parameters w1 on the neurons of the first layer, the lucky set of parameters w2 on the neurons of the second layer, and so on with the intention of minimizing the function.

We don’t just guess randomly each time, we slide towards the better w based on our sense of error from the gradient. The gradient is revealed at the final layer’s output.

However, we are initially dropped randomly in the function. Our first layer’s w has to be random.

This presents a huge problem. Although all we have to do is calculate the gradient of the error with respect to the parameters w, the weights closer to the end of the network tend to change a lot more than those at the beginning. If the initial w randomly falls on [having the trait of lethargic weight updates], the whole network will barely move.

##### By the way, weights are a subset of the parameters. Think of each weight/parameter update as an almost magical multidimensional-step in the stroll through the landscape; with every single step determined by the gradient.

The first guide in our multidimensional landscape may happen to have a broken leg, so he cannot explore his environment very well. Yet guide number two and guide number three must receive directions from him. This means that they will also be slower at finding the bottom of the valley.

LSTMs solve this by knowing how to remember. So now let’s look inside an LSTM.

###### Recurrent neural networks are intimately related to sequences and lists. Some RNNs are composed of LSTM units.

Stay tuned for the explanation of what is going on in there!