AWK conjuncted software

There exist many software tools that a programmer can use in conjunction with AWK to process data the easy way. The truth is that almost any program designed to be used as a filter, can be used in conjunction with AWK.

Processing firewall data

Let's say we want to process firewall output consisting of two column tab separated data, namely IP adresses and visited domains:

124.171.91.158	instagram.com
18.78.75.157	pinterest.com
12.47.109.31	themeforest.net
163.36.205.18	dyndns.org
117.40.14.145	deviantart.com
67.44.108.141	mashable.com
83.186.80.24	ocn.ne.jp
128.221.179.47	clickbank.net
243.222.47.76	hud.gov
110.2.88.207	unblog.fr
...

Download

We want to produce some useful reports based on that kind of data. Some of these reports may be:

  • Print 10 most visited domains
  • Print domains visited by more than 100 different IPs
  • Print each domain followed by IP lists with visit counts for each IP
  • Print 10 most visited domains

    An easy way to print the 10 most visited domains is to count the visits for each domain, then print counts sorted by count and extract the 10 top lines. Save the following AWK script to iptopvisit.awk file:

    {
    	count[$2]++
    }
    
    END {
    	for (domain in count)
    	print domain, count[domain]
    }
    

    Download

    Assuming that the firewall data are stored in fdata file, run the report with:

    awk -f iptopvisit.awk fdata | sort -k2n | tail -10
    

    AWK prints unordered domain visit counts, thus we must sort AWK's output on count number and print the last 10 lines of sorted data. In this case we used AWK in conjunction with sort and tail standard UNIX/Linux tools.

    Print domains visited by more than 100 different IPs

    To carry out that kind of report, we sort the firewall data by domain, then pass sorted data to AWK and continue filtering with some other tools.

    $2 != domain {
    	domain_stats()
    	domain = $2
    }
    
    {
    	visit[$1]++
    }
    
    END {
    	domain_stats()
    }
    
    function domain_stats(		ip, count) {
    	if (!domain)
    	return
    
    	for (ip in visit)
    	count++
    
    	delete visit
    
    	if (count > min_count)
    	print domain, count
    }
    

    Download

    sort -k2 fdata | awk -v min_count=100 -f popular.awk | sort -k2n -k1
    

    AWK used in conjunction with sort to process the report.

    Print domains followed by IP visit counts

    To carry out that kind of report, we sort the firewall data by domain and IP, then pass sorted data to AWK.

    BEGIN {
    	OFS = "\t"
    }
    
    # When new domain arises, print last IP count for the previous domain
    # (if exists) and setup environment for the new domain.
    
    $2 != domain {
    	# print previous domain's last IP count
    	print_ip()
    
    	# set and print current domain
    	print (domain = $2)
    
    	# set current IP to the first IP for the new domain
    	ip = $1
    }
    
    # When new IP arises, print the previous IP count and reset IP and counter.
    
    $1 != ip {
    	# print previous IP count (if exists)
    	print_ip()
    
    	# reset current IP and counter.
    	ip = $1
    	ipcount = 0
    }
    
    # For lines reached this point, just increase IP counter.
    
    {
    	ipcount++
    }
    
    END {
    	print_ip()
    }
    
    function print_ip() {
    	if (ip)
    	print ip, ipcount
    }
    

    Download

    sort -k2 -k1 fdata | awk -f domip.awk
    

    Once again AWK used in conjunction with sort to process the report. The output of the above command may look like this:

    123-reg.co.uk
    0.81.28.192	1
    10.49.133.83	1
    105.82.239.196	1
    10.8.185.215	1
    ...
    52.195.143.187	1
    7.35.86.225	1
    74.151.123.225	1
    zimbio.com
    123.46.248.201	2
    149.143.34.178	1
    149.17.174.81	1
    197.27.231.148	1
    203.208.125.147	1
    229.177.72.93	1
    229.68.72.81	1
    5.116.228.64	1
    64.150.110.10	1
    66.49.171.141	1
    

    We can further process the above output adding another AWK component:

    # Domain lines contain just the domain name, so keep it for
    # future use.
    
    NF == 1 {
    	domain = $1
    	next
    }
    
    # Whenever the visit count is greater/equal than the given minimum,
    # print the line.
    
    $2 >= mincount {
    	# If current domain have not printed yet, print it now and
    	# wipe it out.
    
    	if (domain) {
    		print domain
    		domain = ""
    	}
    
    	print $0
    }
    

    To print domains visited more than twice from the same IP:

    sort -k3 -k1 fdata | awk -f domip.awk | awk -v mincount=2 -f domipcut
    

    Possible output of the above pipeline may look like:

    ehow.com
    0.177.120.131	3
    goodreads.com
    0.124.195.146	3
    home.pl
    134.215.153.48	3
    php.net
    110.236.220.207	3
    sciencedirect.com
    127.240.17.250	3
    shinystat.com
    115.240.225.114	3
    twitter.com
    10.166.203.66	3
    wikipedia.org
    102.229.253.238	3
    

    It becomes clear that there are just a few domains visited more than twice from the same IP; it's also clear that there are no domains visited more than 3 times from the same IP. To achieve all of the above there were needed less than 65 lines of straightforword, readable, elegant AWK code!

    ssconvert - spreadsheet file format converter

    ssconvert is part of Gnumeric, the GNU spreadsheet program. ssconvert converts spreadsheet data files from one format to another, e.g. from xls to CSV, or from xlsx to ods etc. The program can be used as a filter, that means it can read data from standard input (fd://0) or write output to standard output (fd://1).

    In order to demonstrate ssconvert usage, we will produce random data using PPL libray functions:

    @include "ppl.awk"
    
    BEGIN {
    	OFS = "\t"
    
    	for (count += 0; count > 0; count--)
    	print ppl_login(), ppl_name(), ppl_email()
    }
    

    Download

    If the above AWK script is saved to rndgen.awk we can easily construct a LibreOffice Calc file with three columns, nameley login, full name and email, just by pipping AWK output to ssconvert:

    awk -v count=1000 -f rndgen.awk | ssconvert \
    	--import-type=Gnumeric_stf:stf_csvtab fd://0 user1000.ods
    

    The above command creates the file user1000.ods in LibreOffice Calc format. Conversely, we are able to process spreadsheet data of any known format using AWK, just by converting the data to ascii text and passing the output to AWK. Thus, given the LibreOffice Calc user1000.ods file, the following command will print just those lines with "us" email TLD:

    ssconvert --export-type=Gnumeric_stf:stf_assistant -O 'separator="^" quote=' \
    	user1000.ods fd://1 | awk -F "^" -v tld=us '$3 ~ "\\." tld "$"'