Opening The Door To Quantum Mechanics

One of the most common misconceptions about quantum mechanics is that an observation is simply one particle interacting with another particle. This false impression misses the true essence of what makes quantum mechanics philosophically intriguing.
Screen Shot 2018-09-25 at 3.36.46 PM
(Not what an observation is. And not what particles are.)
The truth is that there are no individual particles. But let’s talk as if there were for the sake of simplicity. In the same way that we talk about people even though no person actually exists.
Suppose we have a quantum randomizer which causes our particle to go in one of two directions.
Screen Shot 2018-09-25 at 3.42.03 PM
Now let’s add a second particle to our system. The first particle will interact with the second particle.
Screen Shot 2018-09-25 at 4.45.42 PM
The moment these two particles interact we say that they are entangled with one another. This is because if the first particle had gone in the other direction then the trajectory of the second particle would be completely different.
By just observing the second particle alone this will be enough to know which of the two directions the first particle went in. The second particle therefore acts as a detector for the first particle.
But what if we choose not to observe either particle? According to quantum mechanics each particle will simultaneously be in a combination of both possibilities which we call superposition.
Now suppose we observe one of the two particles. The superposition seems to disappear, and we always see only one of the possibilities.
The two particles interacting with each other is not what counts as the observation.
After the two particles interact, both possibilities still exist, and it is only after the observation that only one of the two options becomes certain. After the two particles interact, we only need to observe one of the two particles to know about the state of both of the particles. We refer to this by saying that after the two particles interact, they are entangled with one another.
So the reason it becomes certain is either because a physicist’s consciousness has a magical power or because there are also two physicists. Each one doesn’t know that he is also the other.
Screen Shot 2018-09-25 at 4.58.58 PM
This doesn’t just happen with paths. Something similar happens to the spins of two particles being entangled with one another. The spin of a particle in a particular direction can be observed to have only one of two possible values. These values are spin-up and spin-down.
Suppose we also have a second particle. There are now four different sets of possible observations. Just as our previous example could simultaneously be in a superposition of two different states when we were not observing it, this system can simultaneously be in a superposition of four different states when we are not observing it.
Screen Shot 2018-09-25 at 5.28.03 PM
Suppose we briefly observe only the particle on the right.
Screen Shot 2018-09-25 at 5.45.14 PM
Suppose we see that the particle on the right is spin-up. This means that two of the four possibilities disappear. The quantum system is now simultaneously in a superposition of only two possibilities.
Screen Shot 2018-09-25 at 5.47.02 PM
This quantum system does not contain any entanglement because measuring the spin of one of these two particles will not tell us anything about the spin of the other particle.
Let us use one of these particles as a detector to determine the spin of the other particle:
Screen Shot 2018-09-25 at 6.31.43 PM
As we bring the particles together, if the two particles are spinning in the same direction then our experimental setup will cause the particle on the right to change its spin to the opposite direction.
But if the two particles start out spinning in opposite directions then nothing will change when we start out. The particle on the right is known to be pointed up whereas the spin of the particle on the left is unknown. The system consists of both of these possibilities existing simultaneously.
If we run our experiment without observing either particle. The system will continue to be in a superposition of two possibilities existing simultaneously. But regardless of which of the two states the system started in, after these particles have interacted with each other, they are guaranteed to be spinning in opposite directions. We therefore now only need to observe one of the two particles to know the spins of both particles. As a result, after the two particles have interacted, we say that they are entangled with each other.
Suppose we allow these two particles to interact and become entangled but we do not observe either particle.  The system consists of both of these possibilities existing simultaneously. It’s only when we observe at least one of these particles that the outcome of the entire system becomes certain according to the mathematics of quantum mechanics. This remains true regardless of how many particles we have.
A detector simply consists of a large number of particles. This means that if we have two entangled particles, measuring the spin of one of the particles with a detector will not
necessarily tell us the spins of the two particles. If we are not observing the detector or the particles, then the two particles will simply become entangled with all the particles inside the detector in the same way that the two particles are entangled with each other. According to the mathematics of quantum mechanics, both sets of possible outcomes will exist simultaneously.
Suppose we observe the detector – which means that we observe at least one of the many particles that the detector is made of. Once we observe the detector, all the particles inside the detector and the two spinning particles that we originally wanted to measure will all simultaneously “collapse” into one of the two possibilities.
According to the mathematics of quantum mechanics, it does not matter how many particles the system is made of. We can connect the output signals of our detectors to large complex objects, causing these large objects to behave differently depending on the
measurements and the detector. According to the mathematics of quantum mechanics, if we do not observe the system, both possibilities will exist simultaneously – at least seemingly until we observe one of the many entangled particles that make up the system.
It is arbitrary to think that the universe only “collapses” at the whim of particular people or their instruments. To paraphrase Stephen Hawking, “It is trivially true that what the equations are describing is Many Worlds.” It is not just the separate magisterium of small things such as electrons, photons, buckyballs, and viruses that exist in Many Worlds. Humans and all other approximate objects also exist simultaneously but obviously can never experience it by the Nagel bat essence of consciousness. That is, in order to experience something, you have to be it – like an adjective on the physical configuration. So you are also in each “alternate” reality but it is impossible to feel this intuitively because consciousness is not some soul that exists disembodied from the machinery. Your million clones are just as convinced that they were never you. I am also intuitively convinced that I was never you, but this is wrong physically.
Of course, we can define “I” as something different from that adjective-like Being, something different from the raw qualia, so to speak.
Screen Shot 2018-09-25 at 6.50.33 PM
We must be very clear that we are drawing lines around somewhat similar configurations, and not fashioning separate souls/consciousnesses.
Screen Shot 2018-09-25 at 6.56.46 PM
Okay, back to the QM. Here, once the particles become entangled, the two different possible quantum states are represented by the colors yellow and green.
Screen Shot 2018-09-25 at 7.08.15 PM
The yellow particles pass right through the green particles without any interaction. After the entanglement occurs, the system is represented by a wavefunction in a superposition of two different quantum states, represented here by yellow and green.
Screen Shot 2018-09-25 at 7.14.30 PM
One wave is not really above the other but this visualization illustrates how the yellow quantum state is unable to interact with green quantum state. Since the yellow wave can’t interact with the green wave, no interference pattern is created with the detectors present.
Screen Shot 2018-09-25 at 7.19.36 PM
On the other hand, with the detectors removed, the entanglement with the detectors never happens and the system does not split into the yellow and green as before. The resulting waves are therefore able to interact and interfere with each other. Two waves interacting with each other creates a striped pattern. This is why a striped probability pattern is created when particles pass through two holes without any detectors present, and it’s why a striped probability pattern is not created when particles pass through two holes with detectors present.
Screen Shot 2018-09-25 at 7.27.52 PM
Having just one detector present has the same effect as having two detectors. This is because only interaction with a single particle is required in order for entanglement to occur. But even after a particle interacts with a detector consisting of many different particles, the system is still in both states simultaneously until we observe one of the detectors.
There’s considerable debate as to what is really happening and there are many different philosophical interpretations of the mathematics. In order to fully appreciate the essence of this philosophical debate it’s helpful to have some understanding of the mathematics of why entanglement prevents the wavefunctions from interacting with each other.
The probability of a particle being observed in a particular location is given by the square of the amplitude of the wavefunction at that location.
Screen Shot 2018-09-25 at 7.44.05 PM
In this situation, the wavefunction at each location is the sum of the wavefunctions from each of the two holes.
Although there are many different places that the particle can be observed, to simplify the analysis, let’s consider a scenario where the particle can be in only one of two places. This scenario is similar to the scenario measuring the spin of a single particle in that there are only two possible outcomes that can be observed.
Screen Shot 2018-09-25 at 5.47.02 PM
The state of spin up can be represented by a 1 followed by a 0.
Screen Shot 2018-09-26 at 7.36.57 AM
The state of spin-down can be represented by a 0 followed by a 1.
Screen Shot 2018-09-26 at 7.37.20 AM
Similarly, we can use the same mathematical representation for measuring the location of our particle. We will signify observing the particle in the top location with a 1 followed by a 0 and we will signify observing the particle in the bottom location with a 0 followed by a 1.
Screen Shot 2018-09-26 at 8.00.53 AM
Let’s now add a detector indicating which of the two holes the particle passed through. We are going to observe both the final location of the particle and the status of the detector.
Screen Shot 2018-09-25 at 4.45.42 PM
There are now a total of four different possible sets of observations. This is similar to how we had four different possible sets of observations when we had two spinning particles. Although our detector is a large object, let us suppose that this detector consists of just a single particle. In the case of the two spinning particles, each of the four possible observations can be represented with a series of numbers as shown.
Screen Shot 2018-09-25 at 5.28.03 PM
The same mathematical representation can be used in the case of observing the position of our particle and the status of our detector. Here we need four numbers because there are four possible outcomes when the status of the detector is included. But if we didn’t have the detector, we would only need two numbers because there are only two possible outcomes. This is the same way in which we need two numbers for a single spinning particle.


The principle of quantum superposition states that if a physical system may be in one of many configurations—arrangements of particles or fields—then the most general state is a combination of all of these possibilities, where the amount in each configuration is specified by a complex number.

For example, if there are two configurations labelled by 0 and 1, the most general state would be

c₀ |0> + c₁ |1>

where the coefficients are complex numbers describing how much goes into each configuration.


The c are coefficients. The probability of observing the spin of the particle in each of the two states is given by the squares of the magnitudes of these coefficients. If we have two spinning particles we can have four possible observations, each of which is represented with a sequence of four numbers.

If the system is in a superposition of all four states simultaneously, then this is represented by the same mathematical expression. As before, the c are constants. As before, the probability of observing the spins of the particles in each of the four states is given by the squares of the magnitudes of each of these constants.
This same mathematical representation can be used to describe observing the location of the particle and the state of the detector. Here, the c coefficients represent the values of each of these wavefunctions at the final location of the particle when the system is in a superposition of these four possibilities:
Screen Shot 2018-09-26 at 10.28.14 AM
But if we never had the detector then each quantum state would be represented by only two numbers instead of four since there are only two possible observations. As before, the c coefficients represent the values of the wavefunction from each of the two holes at the final locations of the particle without the detector. If the system is in a superposition of both quantum states simultaneously, it’s represented mathematically as follows:
c₀ |0> + c₁ |1>
Here, if one of the c coefficients is positive and another c coefficient is negative, they can cancel each other out. On the other hand, the c coefficients would never be able to cancel each other out with a detector present. With a detector present, even if one of the c coefficients is positive and the other c coefficient is negative, their magnitudes always strengthen each other when calculating the probability of observing the particle at a certain position. But without a detector, if one of the c coefficients is positive and the other c coefficient is negative and their magnitudes are equal, then they will cancel each other out completely and provide a probability of zero.
If the particle is not limited to being at just two possible positions, then there will be certain locations where the c coefficients representing the values of the two wavefunctions will cancel each other completely. This is what allows a striped probability pattern to form when there is no detector present, and it’s also why a striped probability pattern does not form if there is a detector present.
Note that nowhere in this mathematical analysis was there ever any mention of a conscious observer. This means that whether or not the striped pattern appears has nothing to do with whether or not a conscious observer is watching the presence or absence of a detector. Just a single particle is enough to determine whether or not there is a striped pattern. A conscious observer choosing whether or not to watch the experiment will not change this outcome but because the mathematics says nothing about the influence of a conscious observer, the mathematics also says nothing about when the system changes from being a superposition of multiple possible outcomes simultaneously to being in just one of the possibilities. When we observe the system we always see only one of the possible outcomes but if conscious observers don’t play any role then it’s not clear what exactly counts as an observation since particles interacting with each other do not qualify.
There’s considerable philosophical debate on the question of what counts as an observation, and on the question of when, how, and if the system collapses to just a single possible outcome. However, it seems that most of the confusion stems from being unable to think like an open individualist – being unable to adhere to a strictly reductionist, physicalist understanding.
Some philosophers want there to be a “hard problem of consciousness” in which there are definite boundaries for souls with particular continuities. But if we just accept the mathematical and experimental revelation, we see that this ontological separation is an illusion. Instead, what we try to capture when we say “consciousness” can only be a part of the one Being containing all its observations. It is in this sense that consciousness is an illusion. We do not really say that qualia is unreal, but rather that it cannot be mapped to anything more than a causal shape that lacks introspective access to its own causes. A self-modeling causal shape painting red cannot be a self-modeling causal shape painting blue. But ultimately, the paintings occur on the same canvas.
Of course, there is a way to formulate the hard problem of consciousness so that it points to something. That which it points to is the hard problem of existence. Why is there something as opposed to nothing? This question will never have an answer. With David Deutsch, I take the view that the quest for knowledge doesn’t have an end because that would contradict the nature of existence. The quest for knowledge can be viewed as exploration of the experiential territory. If you had a final answer, a final experience, then this would entail non-experience (non-experience cannot ask Why is there something as opposed to nothing?).
Fantasizing about a final Theory of Everything is thinly veiled Thanatos Drive – an attempt at self-destruction which eternally fails; not least because of quantum immortality.

Kawaii LSTMs


I created the anime girl faces with Yanghua Jin et al’s GAN.
Take the link. But it’s… it’s not like I like you or anything. Baka!


Screen Shot 2018-03-13 at 10.41.43 AM
First, you must learn slopes

Slopes are changes in y over changes in x. In calculus, we discover that they are the tangent line to a point on a function.

300px-Tangent_to_a_curve.svgIf you know the inclination of the slope, you know if you are walking up a hill or down a hill, even if the terrain is covered in fog.


Screen Shot 2018-03-13 at 11.08.27 AM
Math is useful in life. *__*

The higher the value of the function, the more error it represents. The lower, the less error.

We wish to know the slope so that we can reduce error. What causes the error function to slither up and down in its error are the parameters. 

If we can’t feel the slope, we don’t know if we should step to the “right” or to the “left” to reduce the error.

Screen Shot 2018-03-13 at 11.07.40 AM
But we ought to make the slope flat right?

Not necessarily. The problem is that we can end up on the tip-top of a hill and also have a flat slope. We want to end up at a minimum. This means that we must follow a procedure: If negative slope, then move right. And if positive slope, move left. Never climb, always slide.

There is a similar procedural mission going on in a neural network except that the sense of error comes from a higher-dimensional slope called a gradient.

Screen Shot 2018-03-13 at 11.08.27 AM
You mean that what you said was wrong?

No, its very similar. The gradient tells you the direction of steepest ascent in a multidimensional terrain. Then, you must step towards the negative gradient.

Oh, I didn’t mention that the terrain was multidimensional? Well it is. There is not a single place where the input goes like in the function I initially showed you.

This means that not only is there fog but that the hills and valleys are beyond human comprehension. We can’t visualize them even if we tried. But like the sense-of-error from slope which guides us down a human-world hill, the gradient guides us down to the bottom in multi-dimensional space.

The neural network is composed of layers. Each layer has a landscape to it, and hence its own w of parameters with its own gradient.

Here is the objective function Q(w) for a single layer:


The goal is to plug in a lucky set of parameters w1 on the neurons of the first layer, the lucky set of parameters w2 on the neurons of the second layer, and so on with the intention of minimizing the function.

We don’t just guess randomly each time, we slide towards the better w based on our sense of error from the gradient. The gradient is revealed at the final layer’s output.

However, we are initially dropped randomly in the function. Our first layer’s w has to be random.

This presents a huge problem. Although all we have to do is calculate the gradient of the error with respect to the parameters w, the weights closer to the end of the network tend to change a lot more than those at the beginning. If the initial w randomly falls on [having the trait of lethargic weight updates], the whole network will barely move.


By the way, weights are a subset of the parameters. Think of each weight/parameter update as an almost magical multidimensional-step in the stroll through the landscape; with every single step determined by the gradient.

The first guide in our multidimensional landscape may happen to have a broken leg, so he cannot explore his environment very well. Yet guide number two and guide number three must receive directions from him. This means that they will also be slower at finding the bottom of the valley.

LSTMs solve this by knowing how to remember. So now let’s look inside an LSTM.

Screen Shot 2018-03-13 at 11.07.40 AM
Now, at last you get to the point?!




Recurrent neural networks are intimately related to sequences and lists. Some RNNs are composed of LSTM units.


Stay tuned for the explanation of what is going on in there!

How to Create a Custom Sacred Text with Artificial Intelligence

Okay, let’s create a new religion using the power of neural networks. That’s my definition of a night well spent.

I will feed it Neon Genesis Evangelion, some of the Buddhist Suttas, Wikipedia articles about cosmology, and text from, and see what kind of deep-sounding fuckery it comes up with.

To do it yourself, first install Python and Keras and a backend (Theano or TensorFlow). Make sure you install the backend first, then Keras. Make sure the version of Python that comes out in terminal when you type python as the first step comes out to be the same as where Keras is installed.

To find out where Keras is installed, pip install keras. There should be a version of Python that it mentions. You don’t want Python in terminal to be 2.6 and Keras to be on Python 3.6. If this is the case, type python3 instead of python.

If you are pasting each line into terminal, watch for the ‘>>>‘ and ‘‘. If there is an indentation in the script, you should tab after ‘…’. If there is no longer indentation, you must enter out of ‘‘ so that ‘>>>‘ shows up again.

The better, less tedious way to run it is to save the script as a .py file using the Python Shell. Once you save it, paste this on top of the code: #!/usr/bin/env python

Go to terminal and enter chmod +x, replacing with the entire path to your file, such as /Users/mariomontano/Documents/  You should find the path on top of the window when you create and save a new file on Python Shell.

Then type python3 /Users/mariomontano/Documents/ into terminal to run it.

I’m going to explain the code to reduce the unease.

from __future__import print_function

from_future_import print_function is the first line of code in the script. This commits us to having to use print as a function now. A function is a block of code that is used to perform a single action.

The whole point of from_future_import print_function is to bring the print function from Python 3 into Python 2.6+ just in case you’re not using Python 3. If you are using Python 3, don’t worry about it.

from keras.callbacks import LambdaCallback

So there is a training procedure we have to set off, but we’re going to want to view the internal states and statistics of the model during training.

This particular callback allows us to create a custom callback that reports at a certain time. In our case, we want it to reveal some info at an arbitrary cutoff used to separate training into distinct phases, which is useful for logging and periodic evaluation. We call this arbitrary cutoff an epoch. So at the end of an epoch, it will report some stuff we set it up to report.

from keras.models import Sequential

This time, we are choosing the kind of neural network – the model. There are two kinds of models in Keras: Sequential and Functional API.  Basically, you use the Sequential Model if you want to keep things simple, and you use the Functional API to custom design more complex models, which include non-sequential connections and multiple inputs/outputs. We want to keep things simple.

from keras.layers import Dense, Activation

Here, we are bringing two important things to the table: dense( ) will allow us to summon layers with a chosen number of neurons, and activation( ) is for choosing a function that is applied to a layer of neurons. By tweaking the kind of activation function and number of neurons, you can make the model better or worse at what it does.


from keras.layers import LSTM

An LSTM is a type of recurrent neural network that allows information to be remembered. We don’t want it to forget everything in each training round.

from keras.optimizers import RMSprop

An optimizer is one of the two arguments required for compiling a Keras model. RMSprop is an optimizer which is usually a good choice for recurrent neural networks.

from keras.utils.data_utils import get_file

This will allow us to download a file from a URL not already in the cache.

import numpy as np

This allows us to use numpy for example as in np.array([1,2,3]) instead of numpy.array([1,2,3]).

import random

random will allow us to generate integers. This will be important down the line. Remember that first we are equipping ourselves.

import sys

sys is a module which provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. One such function is sys.stdout.write, which is more universal if you ever need to write dual-version code (e.g. code that works simultaneously with Python 2.x as well as Python 3.x). Combined with sys.stdout.flush, it will later allow us to see output even before the script completes. If we didn’t use these, then we would see the output printed all at once to the screen, at the end. This is because the output is being buffered, and unless we flush sys.stdout with each print, we won’t see the output immediately.

import io

This will allow us to access web data – to open the cereal box for our hungry machine when that delicious cereal is in a web page. As in: , ). On the left side of the comma goes the path, and on the right goes the character encoding. The path will be to the web page where your text data is held and the character encoding will be utf-8.

path = get_file(nietzsche.txt, origin=

Just replace

path = get_file(nietzsche.txt, origin=

with a path to your own text or combination of texts.

If you don’t want it hosted on a site and are on a Mac, you can store a file as .txt then find it by:

right clicking on file in finder -> Get Info -> copy the stuff in front of Where: + the file name with .txt at the end -> path = “____ ” 


with, encoding=utf-8) as f:

text =

the upper line of code opens the path we defined above as ‘nietzsche.txt’ while encoded as ‘utf-8’ (If you don’t pass in any encoding, a system-specific default will be picked.The default encoding cannot actually express all characters (this will happen on Python 2.x and/or Windows).)

We do as f: so that we can then easily instead of, encoding=utf-8).read().lower(). When we do this, f is called a file object.

We read().lower() so that the string comes out in lower case.

print(corpus length:, len(text))

This will output the statement ‘corpus length:’ and the number of characters in the entire string of text. Remember that a string is a linear sequence of characters.

chars =sorted(list(set(text)))

sorted(listdoes this:


it scrambles the order of the characters.

set(text) makes sure that each character only exists once.

For example: ‘The dog went to the pound after eating a pound of dog.’ would become [the, dog, went, to, pound, after, eating, a, of] if each character was a word. But in our case, each character is a letter/number/special.

So just think of that example but with individual letters. Out of a large corpus, you would probably get out the entire alphabet, numbers, and special characters.

print(total chars:, len(chars))

This will give total chars: 57, for example. It gives you the amount of characters after eliminating all repeated characters. Unlike print(corpus length:, len(text)), which should give you the number of the entirety of characters.

char_indices =dict((c, i) for i, c in enumerate(chars))

enumerate(chars) will assign a number to each character. The numbers start at 0 and climb up, 1,2,3,4… for each character in the text.

dict() will set the character/(the arbitrary object/key) equal with its assigned number/(its index/value).

indices_char =dict((i, c) for i, c in enumerate(chars))

This may seem a bit redundant, but this reverse mapping ensures that a particular variable (in this case indices_char) stores the characters mapped to their numerical indices. This is so that we can convert the integers back to characters once we start getting integer predictions later on.

In other words, what we did with these two lines of code is create a dictionary that maps each character to a number and vice versa.

i is often referred to as the id of the char.


# cut the text in semi-redundant sequences of maxlen characters

When reading code, a hashtag before a set of words means these words are not part of the code. It is a statement by the author(s) about what a section of code is meant to do. Like a hyper-rushed explanation. Sadly, even then, most code in the world is uncommented… But here I am. Its okay. Mankind may abide in me from now on.

What is meant by cut the text in “semi-redundant sequences” is best explained by looking at what the code itself does.

maxlen =40

This sets the character count in each chunk to 40.

step =3

By setting the step equal to 3, we divide the entire dataset into chunks of length 40, where the beginning of each chunk is 3 steps/characters apart.

sentences = []

next_chars = []

These use brackets instead of () because [] are designed to be used for lists and for indexing/lookup/slicing. Plus the inner contents of [] can be changed. This is exactly what we need. In the next lines of code we will fill these two containers.

for i in range(0, len(text) maxlen, step):

for means the code will be executed repeatedly.

i is the variable name, it stands for any character

range() returns a list of numbers

range(Starting number of the sequence, Generate numbers up to but not including this number, Difference between each number in the sequence)

so if we have range(0,50,3), it will return [0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48]

sentences.append(text[i: i + maxlen])

.append() does this:

Screen Shot 2018-03-09 at 3.10.19 PM

with a being sentences in our case. So this is the range from current index i consisting of  40 characters. i: i + maxlen means “from i to i + maxlen“. We are filling the sentences with 40 characters.

Taking the last two lines of code I explained together, we are filling the sentences with 40 characters every 3 characters.

next_chars.append(text[i + maxlen])

now for next_chars, we fill it with only the next character after that. Notice it doesn’t say text[i:i + maxlen]). We are filling it with a single character.

next_chars is the single next character following after the collection of 40.

So next_chars will be filled with the single next character following after the collection of 40 characters every 3 characters within the specified range.

print(nb sequences:, len(sentences))

This will output the amount of the sentences created by 3-stepping, which should be roughly one third of the corpus length.


x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)

y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

np.zeros(() , ) converts everything to zeros. It will take the number of len(sentences) and create an array of that many zeros. So if len(sentences) gives 6, then np.zeros will create [0.,0.,0.,0.,0.,0.]. On the right side of the comma in np.zeros( , ), we have a number which brackets each of the six zeros and specifies how many zeros within each bracket. With np.zeros((6,2)), we get [[0.,0.],[0.,0.],[0.,0.][0.,0.],[0.,0.],[0.,0.]]. Play around with np.zeros((), ) to get an intuition for it.

dtype=np.bool ensures that there are only two options True or False, 1 or 0.

What we are doing here is storing our data into vectors.

for i, sentence in enumerate(sentences):

for t, char in enumerate(sentence):

x[i, t, char_indices[char]] = 1

y[i, char_indices[next_chars[i]]] = 1

Remember, the indices i and t, stand for any which sentence and char respectively. If this code had a single dimension, then it would go to the first sentence and make it [1,0,0,0,0,…,0], the second sentence will be [0,1,0,0,0,…,0], and so on. And so too with each char.

We now have a 3-dimensional vector for each sentence and a 2-dimensional vector for each char.

This is called one-hot encoding.

print(Build model…)

model = Sequential()

a Sequential model is a linear stack of layers


You use this simple model in several situations. For example, when you are performing regression, you will usually have a final layer as linear.

You also use it when you want to generate a custom Bible based on anime dialogue, Nick Bostrom’s philosophy, and your own Tennysonian solarpunk fiction.

model = Sequential( ) starts the model, which you can design with custom layers, as you will see in the following lines of code.

model.add(LSTM(128, input_shape=(maxlen, len(chars))))

When you do model = Sequential(), you can then choose model.add(Dense()) or model.add(LSTM()). These two are the choices we imported from keras.model way back at the beginning. They are layers: those columns in the picture.

Dense() is considered the regular kind of layer. A linear operation in which every input is connected to every output by a weight. To understand what it actually means, you must go here.

We are using an LSTM layer, so we must specify two things: 1. the amount of neurons in the first hidden layer; (which in our case happens to be equal to the batch_size or the number of samples that are going to be propagated through the network)  2. the input_shape which is specified by maxlen and len(chars) in our case. By saying input_shape=(maxlen, len(chars)) we are essentially telling it “Hey, we will be feeding you 40 characters of 57 kinds (the alphabet plus punctuations, etc.)”

The output dimensionality of the LSTM layer and also the batch_size is 128. Unlike input_shape, this number was not determined based on our data. We specify it by convention because it was probably experimentally tested to be useful across many neural network use-cases. You can change it and possibly receive better results. But be warned that a very big batch size may not fit the memory and takes longer to train.

To clarify and summarize: the batch_size denotes the subset size of our training sample (e.g. 100 out of 1000) which is going to be used in order to train the network during its learning process. Each batch trains the network in a successive order, taking into account the updated weights coming from the previous batch. Here, that number is equal to the neurons in our first hidden layer.


This is a linear layer composed of the same amount of neurons as there are single instances of each character in the text. For example: 57.


This is our final layer.

Remember that our goal is to minimize the objective function which is parametrized with parameters. We update its parameters by nudging them in the opposite direction of the gradient of the objective function. This way, we take little steps downhill. The goal is to reach the bottom of a valley.

Screen Shot 2018-03-16 at 10.01.21 AM

The image shows a function with two inputs. Our function’s landscape cannot be visualized by humans because it has way more than two inputs.

In order to minimize the cost function, it is important to have smooth non-linear output.

A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks. We would want our neural networks to work on complicated tasks like language translations and image classifications. Linear transformations would never be able to perform such tasks.

Activation functions make the back-propagation possible since the gradients are supplied along with the error to update the weights and biases. Without the differentiable non linear function, this would not be possible.

Activation(‘softmax’) works out the activation of each neuron to range between 0 and 1 by its nature:




This is important for our eventual goal of allowing the network to move to a local minimum by little nudges in the direction of the negative gradient. There are many activation functions, but we are using softmax because the softmax function takes an N-dimensional vector as input.

optimizer = RMSprop(lr=0.01)

An optimizer is one of the two arguments required for compiling a Keras model.

This optimizer divides the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.

This helps because we don’t want the learning rate to be too big, causing it to slosh to and fro across the minimum we seek.

model.compile(loss=categorical_crossentropy, optimizer=optimizer)

Once you have defined your model, it needs to be compiled. This creates the efficient structures used by the underlying backend (Theano or TensorFlow) in order to efficiently execute your model during training.

loss= The loss function, also called the objective function is the evaluation of the model used by the optimizer to navigate the weight space.

Since we are using categorical labels i.e. one hot vectors, then we want to choose categorical_crossentropy from the loss function options. If we have two classes, they will be represented as 0, 1 in binary labels and 10, 01 in categorical label format. Our target for each sample character is in a 2-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample.

def sample(preds, temperature=1.0):

def sample takes the probability outputs of the softmax function and outputs the index of the character which is most probable.

The temperature parameter decides how much the differences between the probability weights are weighted. A temperature of 1 is considering each weight “as it is”, a temperature larger than 1 reduces the differences between the weights, a temperature smaller than 1 increases them.

The way it works is by scaling the logits before applying softmax.

# helper function to sample an index from a probability array

As we will see below, def sample takes the probability outputs of the softmax function and outputs the index of the character which is most probable

preds = np.asarray(preds).astype(float64)

np.asarray is the same as np.array except it has fewer options, and copy=False

.astype(float64) –we cast a precision of float 64, which can represent 7 digits

preds = np.log(preds) / temperature

np.log(preds) takes the array into the natural log function

/temperature the temperature is set to 1.0 so there is no need to divide by temperature but we do it anyway for habit-formation.

exp_preds = np.exp(preds)

This is part of the common function to sample from a probability vector. It calculates the exponential of all elements in the input array.

preds = exp_preds / np.sum(exp_preds)

np.sum takes the sum of array elements over a given axis. Since we have not specified an axis, the default axis=None, and we will sum all of the elements of the input array.

probas = np.random.multinomial(1, preds, 1)

np.random.multinomial samples from a multinomial distribution. A multinomial is like a binomial distribution but with many variables.


With (1,_,_) We specify that only one experiment is taking place. An experiment can have p results. For example dice will always yield a number from 1 to 6. We are ensuring that it knows that we only are “playing dice,” and not also coin-flipping – because in that domain there is a different p.

(_,preds,_) This middle term actually expresses the probability of the possible outcomes, p.

The (_,_,1) Ensures that only 1 array is returned.

The array will return values that represent how many times our metaphorical dice landed on “1, 2, 3, 4, 5, and 6.”

return np.argmax(probas)

return np.argmax returns the indices of the maximum values along an axis. We do not specify an axis here. So by default, the index is from the flattened array of probas.

def on_epoch_end(epoch, logs)

# Function invoked at end of each epoch. Prints generated text.


print(—– Generating text after Epoch: %d % epoch)

The #comment explains that.

start_index = random.randint(0, len(text) maxlen 1)

A random integer from 0 to (number of characters in the entire length of the text – 40 – 1). This is the start_index because if we didn’t subtract 41, then some random indices would be so far at the end that they wouldn’t have enough room for the other 39 characters.

for diversity in [0.2, 0.5, 1.0, 1.2]:

print(—– diversity:, diversity)

These are the different values of the generated temperature hyper-parameter (we call it a hyper-parameter to distinguish it from the parameters learned by the model such as the weights and biases).

Low temperature = more deterministic, high temperature = more random.

generated = ‘ 

‘ ‘ is assigned to generated.

sentence = text[start_index: start_index + maxlen]

Each sentence has forty characters from the text.

generated += sentence

This adds 40 characters to the generated value which is ‘ ‘ and then assigns this combined value to generated.

print(—– Generating with seed: “ + sentence + “)


This will print the sentence being used at the moment in quotes after the statement —– Generating with seed:’

for i in range(400):

     x_pred = np.zeros((1, maxlen, len(chars)))

For all numbers from 1 to 400, make x_pred into a matrix with maxlen zeros along one dimension and len(chars) zeros along the other dimension. This would be 40 zeros for maxlen and 57 zeros for len(chars) in our case. We want to represent the space of possibilities where the different characters can appear in our 40 slots.

for t, char in enumerate(sentence):

x_pred[0, t, char_indices[char]] = 1.

This recursively assigns a 1 to a zero without changing the surrounding zeros in the space of 40 by 57 probabilities, effectively cataloguing every possibility of location for each character.

preds = model.predict(x_pred, verbose=0)[0]

This is for predicting.

model.predict expects the first parameter to be a numpy array. Our numpy array is x_pred, which is the space of all possible locations for each character.

next_index = sample(preds, diversity)

Remember that we defined the function sample as (preds, temperature=1.0)

Now we are assigning this to the variable next_index.

next_char = indices_char[next_index]

We set our next character to be the next index from indices_char, where every character was assigned an index. Remember that we made a dictionary that converts from index to character, so we can get away with this.

generated += next_char

+= adds another value with the variable’s value and assigns the new value to the variable. So here we are adding the next character to generated, which is the sentence.

sentence = sentence[1:] + next_char

So we make sure that the sentence goes from the second character to the end plus an added character. Notice that the first character in the sentence is a 0 so by starting from 1, we are cutting the first one off to make room for next_char.



sys.stdout.write and sys.stdout.flush() are basically print. So this shows the next character, the one we are adding to the sentence.

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

This is so that every time the epoch ends, everything is printed., y, batch_size=128, epochs=60, callbacks=[print_callback])

This is what trains the model. The batch_size is 128, which means the number of training examples in one forward/backward pass. We train it 60 times one forward pass and one backward pass for all the training examples.


Read More »