AWK in brief

Introduction

AWK is a multi purpose software tool mainly used as a filter, that is a program to read input and produce output. Input is readed in lines and every line is split into fields. After reading and splitting each line into fields, AWK can take some action e.g. print, count etc, according to patterns matched by the line readed. The following excerpt (pages 2-3) from the famous book "The AWK Programming Language" shows the elegance of AWK:

In other words every AWK program is a series of pattern/action items. AWK reads input one line at a time, checks each line against the specified patterns and takes the correspondig actions for the matching lines. Default action is "print the line", while missing pattern "matches" every input line, meaning that an action without a pattern will be taken for every single line of input. On the other hand, a pattern without an action causes matching lines to be printed, while non matching lines will be whipped out. There are two extra "patterns", namely BEGIN and END, for actions to be taken before any input has been read and after all input has been read. That's all!

Here is a simple example taken by an interview of Brian Kernighan, one of the creators of AWK. We have volcano eruption data in a file. Each line consists of three columns separated by tab characters. First column is the volcano name, second column contains the date of the eruption, while third column is the magnitude of the eruption in a scale from 1 to 6.

Given that specific file of volcanic eruption data, we can do some interesting things with AWK just by using one-liner AWK programs. Let's say the data file is named vedata. The following will print eruptions of magnitude greater than 4:

awk '$3 > 4' vedata

In this case we used AWK as a filter on eruption magnitude. Our program consists of just one pattern, that is check for magnitude to be greater than 4. Action is missing, which means print (default action) the matching lines.

Now lets' print all eruptions of volcano "Krakatoa", filtering data by volcano name:

awk '/Krakatoa/' vedata

or more precisely:

awk '$1 == "Krakatoa"' vedata

Assuming that eruption dates are formated as YYYYMMDD, e.g. 18830827 is August 27 1883, we can print all eruptions of year 1996:

awk '($2 >= 19960000) && ($2 < 19970000)' vedata

awk '($2 > 19959999) && ($2 < 19970000)' vedata

We can produce some amazing results by writing just a few lines of code. Let's say we want to print total eruption counts by magnitude:

awk '{
	count[$3]++
}
END {
	for (i in count) {
		print i, count[i]
	}
}' vedata | sort -n

Now let's print eruption count by year of eruption:

awk '{
	count[int($3 / 10000)]++
}
END {
	for (i in count) {
		print i, count[i]
	}
}' vedata | sort -n

History

AWK is a program developed in Bell Labs of AT&T in the late 1970s as a general purpose tool to be included in the UNIX toolset. The name AWK is derived from the names of its creators: Al Aho, Peter Weinberger and Brian Kernighan.

From the first days after its birth, AWK has been prooved to be an invaluable tool and has been used in more and more shell scripts and other UNIX tools. After GNU project creation, AWK was significantly revised and expanded, eventualy resulted in gawk, the GNU AWK implementation written by Paul Rubin, Jay Fenlason and Richard Stallman. GNU AWK is being maintained by Arnold Robbins since 1994.

Over the years, there have been implemented other versions of awk as well. Brian Kernighan himself developed nawk, which has been dubbed the one-true-awk; nawk has been used in many operating systems e.g. FreeBSD, NetBSD, OpenBSD, macOS and illumos. Another well known AWK implementation is mawk, written by Mike Brennan, now maintained by Thomas Dickey.

Nowdays, plain awk is no more in use because of gawk's increased functionality. Most of modern operating systems offer plain awk as another name (link) of gawk. In my current Linux system there is a symbolic link of /usr/bin/awk to /etc/alternatives/awk which in turn is linked to /usr/bin/gawk.

However, every piece of software must be constantly evolving to reflect modern trends in software or hardware development and AWK is no exception. Over the years there were many aspects to be handled by awk/gawk maintainers, e.g. internationalization, localization, wide characters, need for extensions, C‑hooks, two-way communications etc. Last improvements of gawk are about dynamic extensions, but there are more improvements to be done in the future.

Basics

To use AWK effectively, one must know the core loop of AWK processing and some basic concepts affecting AWK's view of lines and columns. There are few handy command line options which may prove indispensable to know. Basic knowledge of regular expressions is also needed to use pattern matching the right way. It's also good to know how to redirect AWK's standard output to files or pipes; you can also redirect standard input or standard error. All of the above form a good background to start using AWK effectively, but you can also make beneficial use of AWK with much less, as long as you know what you are doing.

One thing that must be clear before using AWK is the execution road map of every AWK program: AWK executes some initialization code given in the BEGIN section (optional). Then, follows the so called core loop; AWK reads input one line at a time and checks each line against given patterns (optional). By the word line we usually mean one line of input, but you can change that by setting RS variable to any single character or regular expression, e.g. if RS is set to "\n@\n", then each line may contain more than one lines of input, that means anything between lines consisting just of a single "@" character is considered a record.

If a line (record) matches a pattern, then the corresponding action is taken. A next command in a taken action causes all remaining pattern/action entities to be skipped for the line at hand and the next input line is readed. The same process is repeated until no more input lines remain. After all input lines have been readed (and processed), the END action is executed (if exists) and AWK exits.

Lines, Records and Fields

Actually, RS means record separator, so it's more accurate to say that AWK reads input one record at a time, than the usual line by line. However, due to default behavior of AWK, the words record and line are used interchangeably in the AWK's jargon. Let's see the use of RS using input consisting of login names, followed by email addresses and phone numbers. Each user name ends with a single "@" character:

panos
panos1962@gmail.com
0030-2310-999999
@
arnold
arnold@yahoo.com
0045-6627-999999
@
mike
mike@awk.info
0020-2661-999999
@
maria
maria@baboo.com
0035-9932-999999

We want to construct a spreadsheet file of two columns, namely the login name and the phone number. We can achieve this task with AWK, just by setting "@" as record separator, tab character as output field separator and pipping the results to ssconvert:

awk 'BEGIN {
	RS = "\n@\n"	# records separated by "@" lines
	OFS = "\t"	# set tab character as output field separator
}

{
	print $1, $3	# print user name and phone number, tab separated
}' user | ssconvert fd://0 user.ods

That was less than 10 lines of AWK code, but the following one-liner would be suffice:

awk -v RS='\n@\n' -v OFS='\t' '{ print $1, $3 }' user | ssconvert fd://0 user.ods

Note the use of -v command line option in order to set the record separator (RS) and the output field saparator (OFS) in the command line. We talked earlier about the RS variable, but what is the OFS?

We can print output using the print command in AWK. To print many items, side by side, we can specify the items just after the print command. In this case, the items are printed one after the other without any kind of separation in between. Using the comma separator between items to be printed, causes AWK to print the items one after the other, although separated by spaces. If we want the sparator to be something else, then we can do so by setting the OFS variable to the separator wanted:

awk -v OFS='<<>>' '{ print $1, $NF }'

Assuming input consists of lines, were each line starts with the user's name, and ends with the user's country, the above AWK one-liner will print the first and last fields of every input line separated by "<<>>":

panos<<>>Greece
maria<<>>India
arnold<<>>Israel
mike<<>>England
...

As you may have already guessed, NF variable is the columns' count of the current input line (record). Just after reading each input line, AWK counts the fields (coulumns) of the line and set the NF accordingly. We can then refer to each field as $1, $2, …$NF, while fields are separated by AWK by the means of the FS variable. FS stands for field separator and by default is any number of consequent space and tab cracters. If input is a CSV file, then we must set FS to ",", while if input fields are tab separated, then we must set FS to "\t".

Regular Expressions

Regular Expressions (RE) are tightly coupled with AWK because AWK talks the language of REs out of the box, e.g. in the following AWK script the pattern sections are just RE matches:

BEGIN { ascetion = bsection = 0 }
/^[aA][0-9]+/ { asection++ }
/^[bB][0-9]+/ { bsection++ }
END { print asection, bsection }

The first pattern is a regular expression match as the two slashes (/) denote. Actually the first pattern means: $0 matches ^[aA][0-9]+ regular expression. But what exactly does this mean? In the RE dialect the ^ denotes the beginning of the checked item (in this case the checked item is the whole line, so the ^ symbol denotes the beginning of the line), followed by the letters a and A enclosed in brackets, which means one of the two letters a or A, followed by a string of numerical digits (at least one). This RE matches the following lines:

a1 This is a…
A1.1 In this chapter…
A29 All of the…
a30B12 Any of these…

We've emphasized the matched part of the line. The second pattern is similar, but the first letter must be b or B for the line to be matched.

Now let's say that we want the above REs to be matched by the second field instead of the whole line. There comes ~, the so called match operator, to rescue:

BEGIN { ascetion = bsection = 0 }
$2 ~ /^[aA][0-9]+/ { asection++ }
$2 ~ /^[bB][0-9]+/ { bsection++ }
END { print asection, bsection }

The above script is readed as: Before reading any input, set asection and bsection counters to zero. For lines where the second field begins with a or A followed by at least one numerical digit, increase asection by one. For lines where the second field begins with b or B followed by at least one numerical digit, increase bsection by one. After processing all of the input lines, print asection and bsection counters.

I/O redirection

It's very easy for AWK to redirect input and output. Actually, redirecting I/O in AWK is very similar to shell I/O redirection: AWK and shell use the same redirection operators, namely ">" for output redirection, "<" for input redirection, ">> for appending data to files and "|" for piping data to another running program.

Warning!
Because it's so easy for AWK to redirect output to files, you must know what you are doing or else you are risking your files and data!

Let's say we want our program to filter input lines as follows: All lines with 3d column value less than a given number (min) to be collected to a file, lines with 3d column value greater than another given number (max) to be collected to another file, while all lines with 3d column value between min and max to be printed to standard output. Here is the AWK program to run this task:

$3 < min { print >less; next }
$3 > max { print >more; next }
{ print }

Assuming the data file is named data and the above program is stored to filter.awk file, we can create files less10 and more20 by running:

awk -v min=10 -v less="less10" -v max=20 -v more="more20" -f filter.awk data

Conclusion

Nowdays there exist plenty of amazing software languages, tools, frameworks, APIs etc for handling almost every need arising in our full computerised world, from simple calculations, to outer space communications, nanorobotics and machine learning. There is no single piece of software to meet every computer need out there, but AWK is a handy single program found in every computer system, standing there listening to your needs without asking for tons of supporting software, neither asking for super extra high payed experts on software engineering, nor for supercomputer sized machines in order to carry out from the simplest to the most complex tasks, just by following a few lines of code.