Unix tip: count uniques without sorting

It occurred to me that you don’t need to sort a file to count the occurrences of unique lines. And that awk has associative arrays.

By way of example, I have a file with some IP addresses in it. Each line has a single IP address and nothing else.

$ wc -l some_ips.csv

Here is how I previously would have counted how many times each IP occurs:

$ time cat some_ips.csv | sort | uniq -c > /dev/null

real 0m0.751s
user 0m0.750s
sys 0m0.006s

Here is how I would do it now:

$ alias just_count="awk '{c[\$0]++}END{for(x in c){print c[x], x}}'"
$ time cat some_ips.csv | just_count > /dev/null

real 0m0.110s
user 0m0.092s
sys 0m0.033s</code>

Note how much less time it takes the second way.

The order of the output may be different, but if you “sort -n” the outputs of the two commands you will see that they are the same.


irssi in a remote screen session

The advantage of running irssi (command-line IRC client) in a persistent screen session on a box you can SSH to is twofold: 1) You get logs of all channel and PM activity, 24/7 except Christmas. (Just kidding, Christmas too!) 2) You can tunnel in from any computer with an internet connection that may not have an IRC client installed.

On OSX there’s Growl, which issues desktop notifications. I find it sort of useless to be on IRC and have no audible notification of incoming messages and mentions. There’s a page with a great explanation for how to set up notification over the tunnel to the box with the irssi session. Here’s the page I’m referring to.

Something that is left as a hanging question is how to automatically start a requisite shell script on startup without it leaving an annoying otherwise useless terminal window open. The comments section on that page seems to be closed or broken, so I’m explaining my fix to this problem here.

Just to make this self-contained and in case the other page goes the way of its comments section, following are simply the steps outlined in that page I mentioned. I assume you have a remote VM that runs persistently and that irssi is installed on. Personally, for this purpose I use a $5/mo. “droplet” through DigitalOcean.

  1. Log in to the remote VM and add the fnotify script to your ~/.irssi/scripts/ directory. Or put it in ~/.irssi/scripts/autorun to have it load automatically when you start irssi.
    (in a remote terminal) $ wget http://bit.ly/fnotify-raw -O ~/.irssi/scripts/autorun
  2. To get fnotify.pl loaded in a pre-existing irssi session, enter this command in irssi:
    /script load autorun/fnotify.pl
  3. Confirm it’s working by PM-ing yourself while tailing the fnotify file:
    (in one remote terminal) $ tail -f ~/.irssi/fnotify
    (in irssi in another remote terminal) /q <your username>
    (also in irssi) test foo bar
  4. On your local machine, install Growl and GrowlNotify. Test the latter:
    (in a local terminal) $ growlnotify -m 'whatever'
  5. Subbing in your username and hostname, save the following in a script as, say, ~/irrsi_growler (modified slightly from the original):
    (ssh ssh_username@your_server.com -o PermitLocalCommand=no \
    ": > .irssi/fnotify ; tail -f .irssi/fnotify " | \ while read heading message; do \
    growlnotify -t "${heading}" -m "${message}"; \
    \ done)&

Now, for the start-on-login issue, my workaround is to name the remote screen session something consistent (e.g. “irssi”) which can be done _in_ the remote screen session by ctrl-a followed by :sessionname <sessionname> and then open irssi (through a tunnel) with a bash function defined in my .bashrc:

irssi_screen() {
[[ -z $(ps aux | grep -v grep | grep irssi_growler) ]] && irssi_growler
ssh -t <username>@<hostname> screen -r <screenname>

It only starts irssi_growler if it’s not already running.

Make sure you make irssi_growler executable:

(local terminal) $ chmod +x ~/irssi_growler

A mystery: For some reason this makes _2_ processes for irssi_growler. Dunno why, but the point is that the number of irssi_growler processes does not grow as you close and reopen the connection and session.

I’m confident that there are simple ways to improve upon my approach to the start-on-login issue that I’m unaware of if not a better solution entirely. Happy for any input.

Intron/exon classification by decision tree

The data: 3190 sequences of 60 DNA base pairs (A,C,T,G), each sequence labeled as an intron to exon segment, an exon to intron segment, or neither.

The source: UC Irvine Machine Learning Repository, care of Delve Datasets.

Introns and exons: When DNA is converted into protein, some seemingly extraneous chunks of DNA are snipped out of the sequence before translation at the ribosome. They’re called “introns”. It seems to not be totally clear how such a phenomenon would arise, although there are a few very interesting things to read about them on Wikipedia. For instance, it is thought that introns are involved in the formation of non-coding but functional RNA. Also, some speculate that eukaryotic life began with an “intron invasion”. I’m not sure what that means but I wish someone had been there with a camera.

The task: Classifying new sequences of DNA as intron to exon segments (“IE”), exon to intron segments (“EI”), or neither (“N”), based solely on their sequence in a 60 base pair window. This is a classification problem, i.e. the question is of the form “What class does this object belong to?”, where here the candidate classes are “IE”, “EI”, and “N”, and the object is a 60 base pair sequence of DNA. At the University of Wisconsin, groups have applied a handful of tools to this problem with this very data, each time training their classifier with 1000 labeled DNA snippets.

The tool: A decision tree is a classifier that is typically represented as a tree graph, oriented so as to resemble a family tree or phylogenetic tree or what have you — see the figures below. Each node represents a question about the object to be classified (the DNA sequence) to which there is a finite discrete set of possible answers (e.g. “Is the marble blue, red, or black?” or, in our case, maybe “What is the first base in the sequence: A, G, C, or T?”). Each arrow emanating from a node corresponds to one of the possible answers to that node’s question, and that arrow ends at another node. That node either represents another question or it is a class label.

Given a decision tree, and an object to classify, we start at the head node, answer the corresponding question, following the corresponding arrow to the next node, and repeat until we reach a class label. We conclude that the object is of that class.

How do we construct a decision tree in the first place? That is, what questions should the nodes ask about the object to be classified? There are several ways. One way is, precisely, to choose the attribute of the object and the partitioning of the values of that attribute that has the highest entropy for the test data. Roughly speaking, this means choosing a question that tells you the most about the object to be classified, averaged over the possible answers to the question, just like the way you would strategically choose a question to ask in a game of 20 questions. (That game itself is an example of a tree, this one having exactly two arrows (yes/no) emanating from any non-terminal node.)

The main advantages of decision tree classifiers are low computational cost, good handling of irrelevant features, and gosh those are pretty graphs, I think. On the other hand, decision trees exhibit substantial overfitting, meaning that as you feed the tree-building algorithm more training data, it eventually begins to decrease training error at the expense of test error. The horror.

Applying the tool: As I have done before, I used as a starting point some code provided with Harrington’s book Machine Learning in Action. Here’s an example of a decision tree classifier built from 15 labeled snippets of DNA. (These figures were produced using pydot, which interfaces Python with the graph visualization software Graphviz. I used the same for a different kind of graph in my post about mapping text documents.)


Here’s how to read this tree, from top to bottom. Start at the head node. The node is labeled ’32’, short for ‘What base is at position 32 in the 60 nucleotide long segment?’. Then you follow the arrow corresponding to the answer to that question: if the 32nd base is ‘C’ we would follow the arrow labeled ‘C’, to the node labeled ‘4’, which asks us about position 4 of the segment, and so on. Eventually we will reach a stopping point, which is called a leaf node. That node is labeled with the class that this tree classifier assigns to the original segment. Boom.

Results: Here’s the tree that was built from 1000 labeled snippets of DNA. Click on the faint image to see it up close.


This classifier achieves an error rate of 14%, which is comparable to the error rates achieved by groups using other methods as tabulated on Delve. That’s encouraging, considering how speedy this procedure is. From start to finish the data processing, training, testing, and plotting takes a little under two seconds. Here’s the error rate as a function of the size of the training set: