A question presents itself to the psychoanalytic mind: Who even thinks these things?
When the valence of mind deteriorates past a certain point, people start looking for exits. One of these exits is to claim that everything is a dream, everything is empty, or diluted of substance in some way. This can be a way to dissociate from the social context that breathes the fire of our self and suffering. The sequence similarity between humans/Homo sapiens and green monkeys/Chlorocebus sabaeus is 94%, so it is to be expected that we would want to cut ties with our social group after trauma; our method for clipping the cord to the social eyes is only slightly more sophisticated than that of other social mammals who diverged from a relatively recent common ancestor.
[Ahh… Yes. This is why Elon Musk dotes on the simulation; longs for the holographic principle to delete his curse.]
What is stated around here about the nature of reality should not be confused with that genre even it sounds weird and therefore you complete the pattern: escapist. I am committed to life. And by life, I mean life in the conventional sense from the indexical present which contains human persons dying from trauma and neurofibrillary tangles.
But talk is cheap, let me take a short detour here to contribute to anti-aging research and prove that I believe in us:
So I contacted the SENS Research Foundation, which is lonely at the frontline in the battle to save humanity (aging is the number one cause of death and disease, remember). They gave me a link to a dataset which contains genes associated with aging. And I’m going to use my machine learning skills to see what I can do with it.
Here is the dataset.
Clean The Dataset
A human may understand that 5p13.1 represents a cytogenetic location. Let me correct that: A smart human might understand that 5p13.1 represents a cytogenetic location. But a neural network certainly can’t take the statement 5p13.1 without modification.
All must be transmuted to digit before it is presented to the neural network. It is not that a neural network is incapable of dealing with human-understandable categories, since such a limitation would surely defeat the point of using such a tool. It is merely the case that we need to repackage the categories with a representation that it can understand.
There are 16 fields on the gene data set. The eleventh field indicates the orientation of the gene. This is represented by a 1 or -1. The 1 and -1 correspond to this:
The direction in which the RNA is transcribed is in the 5′ to 3′ direction. But although a gene always has the orientation 5′ to 3′, it can be on one of two opposite strands denoted by + and -. This is what I will choose as my output label.
Now I have to look for the possible dependent labels – those that stand a chance of having a meaningful correlation with the output label. The first six labels:GenAge ID, symbol, aliases, name, entrez gene id, uniprot, and the previous-to-last five: acc promoter, acc orf, acc cds, and references can be neglected since they are IDs telling us about naming conventions and nothing about the physical structure. Now we have 5 fields for consideration apart from the output label.
Of these 5, let’s inspect which columns don’t present their information in digits.
This is the first row:
Crowded, I know. But the 5 things we care about are on the indices 7, 8, 9, 10, and 16:
you will see there are several labels which are not digits: why, location, and orthologs are labels with values that are not digits. We need to transform them into digits in a meaningful way before passing them into the neural network. And they cannot be encoded into just binary digits (0’s and 1’s) because for each label, there are more than 2 possible values.
For example, looking at the data we see that the label why can have the values “mammal” or the value “cell, functional” or the value “mammal, model, cell”, along with several others.
And the label location can have the values appropriate for a gene locus: 17p13.1, or 20q11.2, or 10q22.2, or whatever other value is appropriate for gene locus. If we had to just specify the chromosome for the gene in a human, we would already have 23 different possibilities.
Since we have so many possible values for each label that we care about, this situation calls for one-hot encoding.
So I have set out to follow the conclusions of this procedure:
If values not digits. → Check if values should be binary.
If they should be binary.→Encode in binary digits.
If they should not be binary. → One-hot encode.
My ultimate goal here is to predict whether a gene is in the 5’→3′ DNA strand a.k.a. the ‘sense’, ‘plus’ or ‘coding’ strand. This + strand has a sequence which is identical to the sequence of the premessenger RNA (except for uracile (U) in RNA, instead of thymine (T) in DNA); this is the coding strand which is not transcribed. Or whether it is in the complementary strand that is transcribed by the RNA polymerase – known as either the ‘Antisense’, ‘Minus’ or ‘Not coding’ strand.
Knowing my ultimate goal, I must take care to make all the data relevant to the final prediction. So I must inspect with my own human eyes and intuitions what the uncleaned data contains.
For the why label/column, the possible values are:
mammal
“mammal,model,cell”
“mammal,cell”
“cell,functional”
human
“human,mammal,cell”
model
“model,functional”
“cell,downstream”
downstream
functional
putative
“mammal,functional,downstream”
“model,putative”
“model,cell”
“model,downstream”
“cell,upstream”
“functional,putative”
“mammal,putative”
upstream
“functional,downstream”
“upstream,putative”
“downstream,putative”
cell
“model,human_link”
“mammal,model”
human_link
“mammal,functional”
“functional,upstream”
“cell,putative”
“mammal,upstream,downstream”
“mammal,cell”
“mammal,human_link”
Each one of those represents a single value that is possible under the label why. We can choose to one-hot encode them or further engineer them into more sophisticated categories that split the column in pieces so that overlap of the variable is reduced.
I will one-hot encode them for now. So I assign an integer value from 0 to 33 for these categorical values and then translate that into a vector which represents the integer by invoking a 1 at that respective index in an array of 0’s.
You can follow along by doing the following:
Download a 64-bit version of Java from here: Java SE Development Kit 8 Downloads
Now you must set Java_Home
If you have a Mac, go to terminal and run the following commands:
export JAVA_HOME=jdk-install-dir export PATH=$JAVA_HOME/bin:$PATHIf you have a different system click here.
You also need an IDE such as IntelliJ.
Download either the permanently free Community or the free trial for Ultimate.
You need Maven.
For a Mac, go to terminal and
brew install maven
If you have a different system click here.
You also need git.
Go here if you don’t have it already.
If you have it already, then just update it with this
git clone git://git.kernel.org/pub/scm/git/git.git
Enter this into terminal
git clone https://github.com/deeplearning4j/dl4j-examples.git
cd dl4j-examples/
mvn clean install
Open IntelliJ and choose Import Project.
Select dl4j-examples.
Choose ‘Import project from external model’ and ensure that Maven is selected.
A simple machine learning algorithm cannot “learn” about information such as words and genes without proper translation. “All is number,” said Plato. “All is number,” says the machine.
There are five fields on the dataset that we care about. Our output label, the one that can be a 0 or 1 and which we are learning to predict, is the orientation described by a 1 or -1 on index 11 in the original data.
When building the schema, you use string for things that aren’t composed solely of numbers in the original data.
Schema schema = new Schema.Builder( )
.addColumnsInteger("GenAge ID")
.addColumnString("symbol")
.addColumnString("aliases")
.addColumnString("name")
.addColumnInteger("entrez gene id")
.addColumnString("uniprot")
.addColumnCategorical("why", Arrays.asList(mammal)
.addColumnString("band")
.addColumnInteger("location start")
.addColumnInteger("location end")
.addColumnInteger("orientation")
…
Unfortunately this was all signaling and real progress requires that we awaken from the slumber of the misaligned need to impress those around us. The competitive spirit of mankind at large must be funneled unto the establishment of rejuvenation therapies that roughly follow the outline sketched by Strategies for Engineered Negligible Senescence in order to rejuvenate our tissues and cells such that a safety net of biological youth is unlocked and an evil is slayed.
How can people be true when their bodies rot? How can they read with comfort and grace when entrance to the library requires signing a contract to burn with all the books?
How can they love when those around them will be destroyed?
Then sprang the happier day from underground;
And revel and song, made merry over Death,
So large mirth lived and Gareth won the quest.