Unix tip: count uniques without sorting

It occurred to me that you don’t need to sort a file to count the occurrences of unique lines. And that awk has associative arrays.

By way of example, I have a file with some IP addresses in it. Each line has a single IP address and nothing else.


$ wc -l some_ips.csv
222049

Here is how I previously would have counted how many times each IP occurs:


$ time cat some_ips.csv | sort | uniq -c > /dev/null

real 0m0.751s
user 0m0.750s
sys 0m0.006s

Here is how I would do it now:


$ alias just_count="awk '{c[\$0]++}END{for(x in c){print c[x], x}}'"
$ time cat some_ips.csv | just_count > /dev/null

real 0m0.110s
user 0m0.092s
sys 0m0.033s</code>

Note how much less time it takes the second way.

The order of the output may be different, but if you “sort -n” the outputs of the two commands you will see that they are the same.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s