AWK conjuncted software
There exist many software tools that a programmer can use in conjunction with AWK to process data the easy way. The truth is that almost any program designed to be used as a filter, can be used in conjunction with AWK.
Processing firewall data
Let's say we want to process firewall output consisting of two column tab separated data, namely IP adresses and visited domains:
124.171.91.158 instagram.com 18.78.75.157 pinterest.com 12.47.109.31 themeforest.net 163.36.205.18 dyndns.org 117.40.14.145 deviantart.com 67.44.108.141 mashable.com 83.186.80.24 ocn.ne.jp 128.221.179.47 clickbank.net 243.222.47.76 hud.gov 110.2.88.207 unblog.fr ...
We want to produce some useful reports based on that kind of data. Some of these reports may be:
Print 10 most visited domains
An easy way to print the 10 most visited domains is to count the visits
for each domain, then print counts sorted by count and extract the 10 top lines.
Save the following AWK script to iptopvisit.awk
file:
{ count[$2]++ } END { for (domain in count) print domain, count[domain] }
Assuming that the firewall data are stored in fdata
file,
run the report with:
awk -f iptopvisit.awk fdata | sort -k2n | tail -10
AWK prints unordered domain visit counts, thus we must sort AWK's output on count number and print the last 10 lines of sorted data. In this case we used AWK in conjunction with sort and tail standard UNIX/Linux tools.
Print domains visited by more than 100 different IPs
To carry out that kind of report, we sort the firewall data by domain, then pass sorted data to AWK and continue filtering with some other tools.
$2 != domain { domain_stats() domain = $2 } { visit[$1]++ } END { domain_stats() } function domain_stats( ip, count) { if (!domain) return for (ip in visit) count++ delete visit if (count > min_count) print domain, count }
sort -k2 fdata | awk -v min_count=100 -f popular.awk | sort -k2n -k1
AWK used in conjunction with sort to process the report.
Print domains followed by IP visit counts
To carry out that kind of report, we sort the firewall data by domain and IP, then pass sorted data to AWK.
BEGIN { OFS = "\t" } # When new domain arises, print last IP count for the previous domain # (if exists) and setup environment for the new domain. $2 != domain { # print previous domain's last IP count print_ip() # set and print current domain print (domain = $2) # set current IP to the first IP for the new domain ip = $1 } # When new IP arises, print the previous IP count and reset IP and counter. $1 != ip { # print previous IP count (if exists) print_ip() # reset current IP and counter. ip = $1 ipcount = 0 } # For lines reached this point, just increase IP counter. { ipcount++ } END { print_ip() } function print_ip() { if (ip) print ip, ipcount }
sort -k2 -k1 fdata | awk -f domip.awk
Once again AWK used in conjunction with sort to process the report. The output of the above command may look like this:
123-reg.co.uk 0.81.28.192 1 10.49.133.83 1 105.82.239.196 1 10.8.185.215 1 ... 52.195.143.187 1 7.35.86.225 1 74.151.123.225 1 zimbio.com 123.46.248.201 2 149.143.34.178 1 149.17.174.81 1 197.27.231.148 1 203.208.125.147 1 229.177.72.93 1 229.68.72.81 1 5.116.228.64 1 64.150.110.10 1 66.49.171.141 1
We can further process the above output adding another AWK component:
# Domain lines contain just the domain name, so keep it for # future use. NF == 1 { domain = $1 next } # Whenever the visit count is greater/equal than the given minimum, # print the line. $2 >= mincount { # If current domain have not printed yet, print it now and # wipe it out. if (domain) { print domain domain = "" } print $0 }
To print domains visited more than twice from the same IP:
sort -k3 -k1 fdata | awk -f domip.awk | awk -v mincount=2 -f domipcut
Possible output of the above pipeline may look like:
ehow.com 0.177.120.131 3 goodreads.com 0.124.195.146 3 home.pl 134.215.153.48 3 php.net 110.236.220.207 3 sciencedirect.com 127.240.17.250 3 shinystat.com 115.240.225.114 3 twitter.com 10.166.203.66 3 wikipedia.org 102.229.253.238 3
It becomes clear that there are just a few domains visited more than twice from the same IP; it's also clear that there are no domains visited more than 3 times from the same IP. To achieve all of the above there were needed less than 65 lines of straightforword, readable, elegant AWK code!
ssconvert - spreadsheet file format converter
ssconvert
is part of
Gnumeric,
the GNU spreadsheet program.
ssconvert
converts spreadsheet data files from one format to another,
e.g. from xls to CSV, or from xlsx to ods etc.
The program can be used as a filter, that means it can read data from standard input
(fd://0
) or write output to standard output (fd://1
).
In order to demonstrate ssconvert
usage, we will produce random data using
PPL
libray functions:
@include "ppl.awk" BEGIN { OFS = "\t" for (count += 0; count > 0; count--) print ppl_login(), ppl_name(), ppl_email() }
If the above AWK script is saved to rndgen.awk
we can easily
construct a LibreOffice Calc file with three columns, nameley login,
full name and email, just by pipping AWK output to ssconvert
:
awk -v count=1000 -f rndgen.awk | ssconvert \ --import-type=Gnumeric_stf:stf_csvtab fd://0 user1000.ods
The above command creates the file user1000.ods
in LibreOffice Calc format.
Conversely, we are able to process spreadsheet data of any known format using AWK,
just by converting the data to ascii text and passing the output to AWK.
Thus, given the LibreOffice Calc user1000.ods
file,
the following command will print just those lines with "us" email TLD:
ssconvert --export-type=Gnumeric_stf:stf_assistant -O 'separator="^" quote=' \ user1000.ods fd://1 | awk -F "^" -v tld=us '$3 ~ "\\." tld "$"'