# Kawaii LSTMs

###### Take the link. But it’s… it’s not like I like you or anything. Baka!

Slopes are changes in y over changes in x. In calculus, we discover that they are the tangent line to a point on a function. If you know the inclination of the slope, you know if you are walking up a hill or down a hill, even if the terrain is covered in fog.

The higher the value of the function, the more error it represents. The lower, the less error.

We wish to know the slope so that we can reduce error. What causes the error function to slither up and down in its error are the parameters.

If we can’t feel the slope, we don’t know if we should step to the “right” or to the “left” to reduce the error.

Not necessarily. The problem is that we can end up on the tip-top of a hill and also have a flat slope. We want to end up at a minimum. This means that we must follow a procedure: If negative slope, then move right. And if positive slope, move left. Never climb, always slide.

There is a similar procedural mission going on in a neural network except that the sense of error comes from a higher-dimensional slope called a gradient.

No, its very similar. The gradient tells you the direction of steepest ascent in a multidimensional terrain. Then, you must step towards the negative gradient.

Oh, I didn’t mention that the terrain was multidimensional? Well it is. There is not a single place where the input goes like in the function I initially showed you.

This means that not only is there fog but that the hills and valleys are beyond human comprehension. We can’t visualize them even if we tried. But like the sense-of-error from slope which guides us down a human-world hill, the gradient guides us down to the bottom in multi-dimensional space.

The neural network is composed of layers. Each layer has a landscape to it, and hence its own w of parameters with its own gradient.

Here is the objective function Q(w) for a single layer:

The goal is to plug in a lucky set of parameters w1 on the neurons of the first layer, the lucky set of parameters w2 on the neurons of the second layer, and so on with the intention of minimizing the function.

We don’t just guess randomly each time, we slide towards the better w based on our sense of error from the gradient. The gradient is revealed at the final layer’s output.

However, we are initially dropped randomly in the function. Our first layer’s w has to be random.

This presents a huge problem. Although all we have to do is calculate the gradient of the error with respect to the parameters w, the weights closer to the end of the network tend to change a lot more than those at the beginning. If the initial w randomly falls on [having the trait of lethargic weight updates], the whole network will barely move.

##### By the way, weights are a subset of the parameters. Think of each weight/parameter update as an almost magical multidimensional-step in the stroll through the landscape; with every single step determined by the gradient.

The first guide in our multidimensional landscape may happen to have a broken leg, so he cannot explore his environment very well. Yet guide number two and guide number three must receive directions from him. This means that they will also be slower at finding the bottom of the valley.

LSTMs solve this by knowing how to remember. So now let’s look inside an LSTM.

###### Recurrent neural networks are intimately related to sequences and lists. Some RNNs are composed of LSTM units.

Stay tuned for the explanation of what is going on in there!

# How to Create a Custom Sacred Text with Artificial Intelligence

Okay, let’s create a new religion using the power of neural networks. That’s my definition of a night well spent.

I will feed it Neon Genesis Evangelion, some of the Buddhist Suttas, Wikipedia articles about cosmology, and text from futuretimeline.com., and see what kind of deep-sounding fuckery it comes up with.

To do it yourself, first install Python and Keras and a backend (Theano or TensorFlow). Make sure you install the backend first, then Keras. Make sure the version of Python that comes out in terminal when you type python as the first step comes out to be the same as where Keras is installed.

To find out where Keras is installed, pip install keras. There should be a version of Python that it mentions. You don’t want Python in terminal to be 2.6 and Keras to be on Python 3.6. If this is the case, type python3 instead of python.

If you are pasting each line into terminal, watch for the ‘>>>‘ and ‘‘. If there is an indentation in the script, you should tab after ‘…’. If there is no longer indentation, you must enter out of ‘‘ so that ‘>>>‘ shows up again.

The better, less tedious way to run it is to save the script as a .py file using the Python Shell. Once you save it, paste this on top of the code: #!/usr/bin/env python

Go to terminal and enter chmod +x my_python_script.py, replacing my_python_script.py with the entire path to your file, such as /Users/mariomontano/Documents/sacredtext.py  You should find the path on top of the window when you create and save a new file on Python Shell.

Then type python3 /Users/mariomontano/Documents/sacredtext.py into terminal to run it.

I’m going to explain the code to reduce the unease.

from __future__import print_function

from_future_import print_function is the first line of code in the script. This commits us to having to use print as a function now. A function is a block of code that is used to perform a single action.

The whole point of from_future_import print_function is to bring the print function from Python 3 into Python 2.6+ just in case you’re not using Python 3. If you are using Python 3, don’t worry about it.

from keras.callbacks import LambdaCallback

So there is a training procedure we have to set off, but we’re going to want to view the internal states and statistics of the model during training.

This particular callback allows us to create a custom callback that reports at a certain time. In our case, we want it to reveal some info at an arbitrary cutoff used to separate training into distinct phases, which is useful for logging and periodic evaluation. We call this arbitrary cutoff an epoch. So at the end of an epoch, it will report some stuff we set it up to report.

from keras.models import Sequential

This time, we are choosing the kind of neural network – the model. There are two kinds of models in Keras: Sequential and Functional API.  Basically, you use the Sequential Model if you want to keep things simple, and you use the Functional API to custom design more complex models, which include non-sequential connections and multiple inputs/outputs. We want to keep things simple.

from keras.layers import Dense, Activation

Here, we are bringing two important things to the table: dense( ) will allow us to summon layers with a chosen number of neurons, and activation( ) is for choosing a function that is applied to a layer of neurons. By tweaking the kind of activation function and number of neurons, you can make the model better or worse at what it does. from keras.layers import LSTM

An LSTM is a type of recurrent neural network that allows information to be remembered. We don’t want it to forget everything in each training round.

from keras.optimizers import RMSprop

An optimizer is one of the two arguments required for compiling a Keras model. RMSprop is an optimizer which is usually a good choice for recurrent neural networks.

from keras.utils.data_utils import get_file

import numpy as np

This allows us to use numpy for example as in np.array([1,2,3]) instead of numpy.array([1,2,3]).

import random

random will allow us to generate integers. This will be important down the line. Remember that first we are equipping ourselves.

import sys

sys is a module which provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. One such function is sys.stdout.write, which is more universal if you ever need to write dual-version code (e.g. code that works simultaneously with Python 2.x as well as Python 3.x). Combined with sys.stdout.flush, it will later allow us to see output even before the script completes. If we didn’t use these, then we would see the output printed all at once to the screen, at the end. This is because the output is being buffered, and unless we flush sys.stdout with each print, we won’t see the output immediately.

import io

This will allow us to access web data – to open the cereal box for our hungry machine when that delicious cereal is in a web page. As in: io.open( , ). On the left side of the comma goes the path, and on the right goes the character encoding. The path will be to the web page where your text data is held and the character encoding will be utf-8.

path = get_file(nietzsche.txt, origin=https://s3.amazonaws.com/text-datasets/nietzsche.txt)

Just replace

path = get_file(nietzsche.txt, origin=https://s3.amazonaws.com/text-datasets/nietzsche.txt)

with a path to your own text or combination of texts.

If you don’t want it hosted on a site and are on a Mac, you can store a file as .txt then find it by:

right clicking on file in finder -> Get Info -> copy the stuff in front of Where: + the file name with .txt at the end -> path = “____ ”

with io.open(path, encoding=utf-8) as f:

the upper line of code opens the path we defined above as ‘nietzsche.txt’ while encoded as ‘utf-8’ (If you don’t pass in any encoding, a system-specific default will be picked.The default encoding cannot actually express all characters (this will happen on Python 2.x and/or Windows).)

We do as f: so that we can then easily f.read().lower() instead of io.open(path, encoding=utf-8).read().lower(). When we do this, f is called a file object.

We read().lower() so that the string comes out in lower case.

print(corpus length:, len(text))

This will output the statement ‘corpus length:’ and the number of characters in the entire string of text. Remember that a string is a linear sequence of characters.

chars =sorted(list(set(text)))

sorted(listdoes this: it scrambles the order of the characters.

set(text) makes sure that each character only exists once.

For example: ‘The dog went to the pound after eating a pound of dog.’ would become [the, dog, went, to, pound, after, eating, a, of] if each character was a word. But in our case, each character is a letter/number/special.

So just think of that example but with individual letters. Out of a large corpus, you would probably get out the entire alphabet, numbers, and special characters.

print(total chars:, len(chars))

This will give total chars: 57, for example. It gives you the amount of characters after eliminating all repeated characters. Unlike print(corpus length:, len(text)), which should give you the number of the entirety of characters.

char_indices =dict((c, i) for i, c in enumerate(chars))

enumerate(chars) will assign a number to each character. The numbers start at 0 and climb up, 1,2,3,4… for each character in the text.

dict() will set the character/(the arbitrary object/key) equal with its assigned number/(its index/value).

indices_char =dict((i, c) for i, c in enumerate(chars))

This may seem a bit redundant, but this reverse mapping ensures that a particular variable (in this case indices_char) stores the characters mapped to their numerical indices. This is so that we can convert the integers back to characters once we start getting integer predictions later on.

In other words, what we did with these two lines of code is create a dictionary that maps each character to a number and vice versa.

i is often referred to as the id of the char.

# cut the text in semi-redundant sequences of maxlen characters

When reading code, a hashtag before a set of words means these words are not part of the code. It is a statement by the author(s) about what a section of code is meant to do. Like a hyper-rushed explanation. Sadly, even then, most code in the world is uncommented… But here I am. Its okay. Mankind may abide in me from now on.

What is meant by cut the text in “semi-redundant sequences” is best explained by looking at what the code itself does.

maxlen =40

This sets the character count in each chunk to 40.

step =3

By setting the step equal to 3, we divide the entire dataset into chunks of length 40, where the beginning of each chunk is 3 steps/characters apart.

sentences = []

next_chars = []

These use brackets instead of () because [] are designed to be used for lists and for indexing/lookup/slicing. Plus the inner contents of [] can be changed. This is exactly what we need. In the next lines of code we will fill these two containers.

for i in range(0, len(text) maxlen, step):

for means the code will be executed repeatedly.

i is the variable name, it stands for any character

range() returns a list of numbers

range(Starting number of the sequence, Generate numbers up to but not including this number, Difference between each number in the sequence)

so if we have range(0,50,3), it will return [0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48]

sentences.append(text[i: i + maxlen])

.append() does this: with a being sentences in our case. So this is the range from current index i consisting of  40 characters. i: i + maxlen means “from i to i + maxlen“. We are filling the sentences with 40 characters.

Taking the last two lines of code I explained together, we are filling the sentences with 40 characters every 3 characters.

next_chars.append(text[i + maxlen])

now for next_chars, we fill it with only the next character after that. Notice it doesn’t say text[i:i + maxlen]). We are filling it with a single character.

next_chars is the single next character following after the collection of 40.

So next_chars will be filled with the single next character following after the collection of 40 characters every 3 characters within the specified range.

print(nb sequences:, len(sentences))

This will output the amount of the sentences created by 3-stepping, which should be roughly one third of the corpus length.

print(Vectorization…)

x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)

y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

np.zeros(() , ) converts everything to zeros. It will take the number of len(sentences) and create an array of that many zeros. So if len(sentences) gives 6, then np.zeros will create [0.,0.,0.,0.,0.,0.]. On the right side of the comma in np.zeros( , ), we have a number which brackets each of the six zeros and specifies how many zeros within each bracket. With np.zeros((6,2)), we get [[0.,0.],[0.,0.],[0.,0.][0.,0.],[0.,0.],[0.,0.]]. Play around with np.zeros((), ) to get an intuition for it.

dtype=np.bool ensures that there are only two options True or False, 1 or 0.

What we are doing here is storing our data into vectors.

for i, sentence in enumerate(sentences):

for t, char in enumerate(sentence):

x[i, t, char_indices[char]] = 1

y[i, char_indices[next_chars[i]]] = 1

Remember, the indices i and t, stand for any which sentence and char respectively. If this code had a single dimension, then it would go to the first sentence and make it [1,0,0,0,0,…,0], the second sentence will be [0,1,0,0,0,…,0], and so on. And so too with each char.

We now have a 3-dimensional vector for each sentence and a 2-dimensional vector for each char.

This is called one-hot encoding.

print(Build model…)

model = Sequential()

a Sequential model is a linear stack of layers You use this simple model in several situations. For example, when you are performing regression, you will usually have a final layer as linear.

You also use it when you want to generate a custom Bible based on anime dialogue, Nick Bostrom’s philosophy, and your own Tennysonian solarpunk fiction.

model = Sequential( ) starts the model, which you can design with custom layers, as you will see in the following lines of code.

When you do model = Sequential(), you can then choose model.add(Dense()) or model.add(LSTM()). These two are the choices we imported from keras.model way back at the beginning. They are layers: those columns in the picture.

Dense() is considered the regular kind of layer. A linear operation in which every input is connected to every output by a weight. To understand what it actually means, you must go here.

We are using an LSTM layer, so we must specify two things: 1. the amount of neurons in the first hidden layer; (which in our case happens to be equal to the batch_size or the number of samples that are going to be propagated through the network)  2. the input_shape which is specified by maxlen and len(chars) in our case. By saying input_shape=(maxlen, len(chars)) we are essentially telling it “Hey, we will be feeding you 40 characters of 57 kinds (the alphabet plus punctuations, etc.)”

The output dimensionality of the LSTM layer and also the batch_size is 128. Unlike input_shape, this number was not determined based on our data. We specify it by convention because it was probably experimentally tested to be useful across many neural network use-cases. You can change it and possibly receive better results. But be warned that a very big batch size may not fit the memory and takes longer to train.

To clarify and summarize: the batch_size denotes the subset size of our training sample (e.g. 100 out of 1000) which is going to be used in order to train the network during its learning process. Each batch trains the network in a successive order, taking into account the updated weights coming from the previous batch. Here, that number is equal to the neurons in our first hidden layer.

This is a linear layer composed of the same amount of neurons as there are single instances of each character in the text. For example: 57.

This is our final layer.

Remember that our goal is to minimize the objective function which is parametrized with parameters. We update its parameters by nudging them in the opposite direction of the gradient of the objective function. This way, we take little steps downhill. The goal is to reach the bottom of a valley. The image shows a function with two inputs. Our function’s landscape cannot be visualized by humans because it has way more than two inputs.

In order to minimize the cost function, it is important to have smooth non-linear output.

A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks. We would want our neural networks to work on complicated tasks like language translations and image classifications. Linear transformations would never be able to perform such tasks.

Activation functions make the back-propagation possible since the gradients are supplied along with the error to update the weights and biases. Without the differentiable non linear function, this would not be possible.

Activation(‘softmax’) works out the activation of each neuron to range between 0 and 1 by its nature:

This is important for our eventual goal of allowing the network to move to a local minimum by little nudges in the direction of the negative gradient. There are many activation functions, but we are using softmax because the softmax function takes an N-dimensional vector as input.

optimizer = RMSprop(lr=0.01)

An optimizer is one of the two arguments required for compiling a Keras model.

This optimizer divides the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.

This helps because we don’t want the learning rate to be too big, causing it to slosh to and fro across the minimum we seek.

model.compile(loss=categorical_crossentropy, optimizer=optimizer)

Once you have defined your model, it needs to be compiled. This creates the efficient structures used by the underlying backend (Theano or TensorFlow) in order to efficiently execute your model during training.

loss= The loss function, also called the objective function is the evaluation of the model used by the optimizer to navigate the weight space.

Since we are using categorical labels i.e. one hot vectors, then we want to choose categorical_crossentropy from the loss function options. If we have two classes, they will be represented as 0, 1 in binary labels and 10, 01 in categorical label format. Our target for each sample character is in a 2-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample.

def sample(preds, temperature=1.0):

def sample takes the probability outputs of the softmax function and outputs the index of the character which is most probable.

The temperature parameter decides how much the differences between the probability weights are weighted. A temperature of 1 is considering each weight “as it is”, a temperature larger than 1 reduces the differences between the weights, a temperature smaller than 1 increases them.

The way it works is by scaling the logits before applying softmax.

# helper function to sample an index from a probability array

As we will see below, def sample takes the probability outputs of the softmax function and outputs the index of the character which is most probable

preds = np.asarray(preds).astype(float64)

np.asarray is the same as np.array except it has fewer options, and copy=False

.astype(float64) –we cast a precision of float 64, which can represent 7 digits

preds = np.log(preds) / temperature

np.log(preds) takes the array into the natural log function

/temperature the temperature is set to 1.0 so there is no need to divide by temperature but we do it anyway for habit-formation.

exp_preds = np.exp(preds)

This is part of the common function to sample from a probability vector. It calculates the exponential of all elements in the input array.

preds = exp_preds / np.sum(exp_preds)

np.sum takes the sum of array elements over a given axis. Since we have not specified an axis, the default axis=None, and we will sum all of the elements of the input array.

probas = np.random.multinomial(1, preds, 1)

np.random.multinomial samples from a multinomial distribution. A multinomial is like a binomial distribution but with many variables. With (1,_,_) We specify that only one experiment is taking place. An experiment can have p results. For example dice will always yield a number from 1 to 6. We are ensuring that it knows that we only are “playing dice,” and not also coin-flipping – because in that domain there is a different p.

(_,preds,_) This middle term actually expresses the probability of the possible outcomes, p.

The (_,_,1) Ensures that only 1 array is returned.

The array will return values that represent how many times our metaphorical dice landed on “1, 2, 3, 4, 5, and 6.”

return np.argmax(probas)

return np.argmax returns the indices of the maximum values along an axis. We do not specify an axis here. So by default, the index is from the flattened array of probas.

def on_epoch_end(epoch, logs)

# Function invoked at end of each epoch. Prints generated text.

print()

print(—– Generating text after Epoch: %d % epoch)

The #comment explains that.

start_index = random.randint(0, len(text) maxlen 1)

A random integer from 0 to (number of characters in the entire length of the text – 40 – 1). This is the start_index because if we didn’t subtract 41, then some random indices would be so far at the end that they wouldn’t have enough room for the other 39 characters.

for diversity in [0.2, 0.5, 1.0, 1.2]:

print(—– diversity:, diversity)

These are the different values of the generated temperature hyper-parameter (we call it a hyper-parameter to distinguish it from the parameters learned by the model such as the weights and biases).

Low temperature = more deterministic, high temperature = more random.

generated = ‘

‘ ‘ is assigned to generated.

sentence = text[start_index: start_index + maxlen]

Each sentence has forty characters from the text.

generated += sentence

This adds 40 characters to the generated value which is ‘ ‘ and then assigns this combined value to generated.

print(—– Generating with seed: “ + sentence + “)

sys.stdout.write(generated)

This will print the sentence being used at the moment in quotes after the statement —– Generating with seed:’

for i in range(400):

x_pred = np.zeros((1, maxlen, len(chars)))

For all numbers from 1 to 400, make x_pred into a matrix with maxlen zeros along one dimension and len(chars) zeros along the other dimension. This would be 40 zeros for maxlen and 57 zeros for len(chars) in our case. We want to represent the space of possibilities where the different characters can appear in our 40 slots.

for t, char in enumerate(sentence):

x_pred[0, t, char_indices[char]] = 1.

This recursively assigns a 1 to a zero without changing the surrounding zeros in the space of 40 by 57 probabilities, effectively cataloguing every possibility of location for each character.

preds = model.predict(x_pred, verbose=0)

This is for predicting.

model.predict expects the first parameter to be a numpy array. Our numpy array is x_pred, which is the space of all possible locations for each character.

next_index = sample(preds, diversity)

Remember that we defined the function sample as (preds, temperature=1.0)

Now we are assigning this to the variable next_index.

next_char = indices_char[next_index]

We set our next character to be the next index from indices_char, where every character was assigned an index. Remember that we made a dictionary that converts from index to character, so we can get away with this.

generated += next_char

+= adds another value with the variable’s value and assigns the new value to the variable. So here we are adding the next character to generated, which is the sentence.

sentence = sentence[1:] + next_char

So we make sure that the sentence goes from the second character to the end plus an added character. Notice that the first character in the sentence is a 0 so by starting from 1, we are cutting the first one off to make room for next_char.

sys.stdout.write(next_char)

sys.stdout.flush()

sys.stdout.write and sys.stdout.flush() are basically print. So this shows the next character, the one we are adding to the sentence.

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

This is so that every time the epoch ends, everything is printed.

model.fit(x, y, batch_size=128, epochs=60, callbacks=[print_callback])

This is what trains the model. The batch_size is 128, which means the number of training examples in one forward/backward pass. We train it 60 times one forward pass and one backward pass for all the training examples.