The data: 3190 sequences of 60 DNA base pairs (A,C,T,G), each sequence labeled as an intron to exon segment, an exon to intron segment, or neither.
The source: UC Irvine Machine Learning Repository, care of Delve Datasets.
Introns and exons: When DNA is converted into protein, some seemingly extraneous chunks of DNA are snipped out of the sequence before translation at the ribosome. They’re called “introns”. It seems to not be totally clear how such a phenomenon would arise, although there are a few very interesting things to read about them on Wikipedia. For instance, it is thought that introns are involved in the formation of non-coding but functional RNA. Also, some speculate that eukaryotic life began with an “intron invasion”. I’m not sure what that means but I wish someone had been there with a camera.
The task: Classifying new sequences of DNA as intron to exon segments (“IE”), exon to intron segments (“EI”), or neither (“N”), based solely on their sequence in a 60 base pair window. This is a classification problem, i.e. the question is of the form “What class does this object belong to?”, where here the candidate classes are “IE”, “EI”, and “N”, and the object is a 60 base pair sequence of DNA. At the University of Wisconsin, groups have applied a handful of tools to this problem with this very data, each time training their classifier with 1000 labeled DNA snippets.
The tool: A decision tree is a classifier that is typically represented as a tree graph, oriented so as to resemble a family tree or phylogenetic tree or what have you — see the figures below. Each node represents a question about the object to be classified (the DNA sequence) to which there is a finite discrete set of possible answers (e.g. “Is the marble blue, red, or black?” or, in our case, maybe “What is the first base in the sequence: A, G, C, or T?”). Each arrow emanating from a node corresponds to one of the possible answers to that node’s question, and that arrow ends at another node. That node either represents another question or it is a class label.
Given a decision tree, and an object to classify, we start at the head node, answer the corresponding question, following the corresponding arrow to the next node, and repeat until we reach a class label. We conclude that the object is of that class.
How do we construct a decision tree in the first place? That is, what questions should the nodes ask about the object to be classified? There are several ways. One way is, precisely, to choose the attribute of the object and the partitioning of the values of that attribute that has the highest entropy for the test data. Roughly speaking, this means choosing a question that tells you the most about the object to be classified, averaged over the possible answers to the question, just like the way you would strategically choose a question to ask in a game of 20 questions. (That game itself is an example of a tree, this one having exactly two arrows (yes/no) emanating from any non-terminal node.)
The main advantages of decision tree classifiers are low computational cost, good handling of irrelevant features, and gosh those are pretty graphs, I think. On the other hand, decision trees exhibit substantial overfitting, meaning that as you feed the tree-building algorithm more training data, it eventually begins to decrease training error at the expense of test error. The horror.
Applying the tool: As I have done before, I used as a starting point some code provided with Harrington’s book Machine Learning in Action. Here’s an example of a decision tree classifier built from 15 labeled snippets of DNA. (These figures were produced using pydot, which interfaces Python with the graph visualization software Graphviz. I used the same for a different kind of graph in my post about mapping text documents.)
Here’s how to read this tree, from top to bottom. Start at the head node. The node is labeled ’32’, short for ‘What base is at position 32 in the 60 nucleotide long segment?’. Then you follow the arrow corresponding to the answer to that question: if the 32nd base is ‘C’ we would follow the arrow labeled ‘C’, to the node labeled ‘4’, which asks us about position 4 of the segment, and so on. Eventually we will reach a stopping point, which is called a leaf node. That node is labeled with the class that this tree classifier assigns to the original segment. Boom.
Results: Here’s the tree that was built from 1000 labeled snippets of DNA. Click on the faint image to see it up close.
This classifier achieves an error rate of 14%, which is comparable to the error rates achieved by groups using other methods as tabulated on Delve. That’s encouraging, considering how speedy this procedure is. From start to finish the data processing, training, testing, and plotting takes a little under two seconds. Here’s the error rate as a function of the size of the training set: