AWK tutorial
AWK refers to the program itself which may have been implemented as
awk
, nawk
, mawk
, gawk
etc,
or to the language in which AWK programs are written.
So, we can say that AWK is a heavily used software tool
(referring to the program itself),
or that AWK is an easy to learn language
(referring to the AWK language).
AWK can be used as an autonomous program to be executed in the command line
during an interactive shell session,
but it's more likely to use AWK in shell scripts, usually in conjuction with other
software tools.
AWK programs may be just one‑liners given on the fly as the first AWK's command line argument,
but it's not rare for an AWK program to extend in hundreds or even thousands of lines;
if an AWK program counts more than one line, then it's easier (and safer)
to save the program in a file and use ‑f
option in the command line.
We can use multiple scripts using ‑f
more than once in the
command line; in this case the programs are concatenated and run as a whole.
Beyond static AWK scripts written in specified files, it's a common practice to construct AWK scripts on the fly from other running programs, save them in temporary files, run AWK with the constructed programs and then remove the scripts.
AWK syntax
Every AWK program is just a pattern/action series:
pattern { action } pattern { action } ...
Every input line is checked against each pattern in turn.
If the line matches a pattern, then the coreesponding action is taken.
To skip further pattern checks after a taken action,
one can use the next
keyword in the corresponding action.
There exist two special "patterns".
The BEGIN
pattern action is executed before any input has been read,
while the END
pattern action is executed just before AWK program exits.
The following program print "Hello!" before any input line is read, then counts lines
with an even number of fields, and lines with more than five fields.
After all input lines have been read, the program prints the two counters,
then prints "Bye..." and exit:
BEGIN { print "Hello!"; count2 = 0; count5 = 0 } (NF % 2) == 0 { count2++ } (NF > 5) { count5++ } END { print count2, count5; print "Bye..." }
An AWK program may have no actions at all;
in that case AWK just prints input to output just like cat
program does.
If a pattern lacks an action, then every line that matches the patern is printed (default action).
If an action lacks a pattern, then this action is executed for all lines reached this pattern.
There is nothing else to say about the basic syntax of every AWK program.
Counting even numbers
We start this tutorial course by writing a program to count all even numbers in input. Input consists of lines, where each line contains an integer number. To count all even numbers in the input stream we devide each number by 2 and check the remainder to be 0:
(($0 % 2) == 0) { count++ } END { print count }
Did you know?
As you might have already guessed, $0
refers to the current
input line, while NF
refers to the number of fields in the
current input line.
By the way, $1
refers to the first field of the current input line,
$2
refers to the second field of the current input line and so on.
Now let's make some changes in order to count not only evens, but also multiples of 3. Our first (and obvious) approach is to use two counters istead of one, then add another pattern/action concerning division by 3 remainders and finally to print the two counters instead of one:
(($0 % 2) == 0) { count2++ } (($0 % 3) == 0) { count3++ } END { print count2, count3 }
In order to understand how AWK is working, we'll trace AWK's logic by hand
for some input numbers.
Let's say we get number 4 in the input.
AWK checks number 4 against the first given pattern, that is if the remainder
of 4 devided by 2 is 0; this pattern is matched by number 4, therefore the relevant action is taken,
that is count2
will be increased; because it's the first time that variable
count2
comes into play, it has no value, so adding 1 to
count2
set that variable to 1, as undefined variables involved
in numerical expressions will be considered as zero valued.
By the way, undefined variables involved in alphahnumeric (string) expressions,
will be considered as empty strings.
After finishing with the first pattern/action, AWK moves to the next pattern/action which is to check the number against division by 3 which clearly fails, as 4 is not a multiple of 3. Therefore no action is taken and there exist no more pattern/action entitites to be processed, so AWK reads the next input line, which let's assume is the number 11.
Clearly, 11 doesn't divide exactly neither by 2, nor by 3, so AWK takes no action
and reads the next input line, let's say the number 12.
Number 12 is devided by 2, so count2
will be increased.
Number 12 is also devided by 3, so count3
will be increased as well.
The same procedure will be followed for every input line.
After reading all input lines, AWK will execute the END
action which
is to print count2
and count3
counters, and then exits.
Mission accomplished!
More counting…
Our second approach in counting evens and multiples of three is to make the number counting AWK program more flexible, that is not just to count multiples of 2 and 3, but to count multiples of any given numbers; we'll call these numbers as check numbers. Clearly, there are two tasks involved in that process:
Following the above scheme we may run the program in two phases. Because this is an AWK tutorial, we choose to implement both of these two phases using AWK. First phase consists of constructing another AWK script to be used in the second phase:
BEGIN { sep = "print" } { print "(($0 % " $0 ") == 0) { count" $0 "++ }" cntcmd = cntcmd sep " count" $0 sep = "," } END { print "END { " cntcmd " }" }
Assuming that the above script is saved in morecount.awk
,
run AWK and type the check numbers 2, 3, 5 and 7:
awk -f morecount.awk2(($0 % 2) == 0) { count2++ }3(($0 % 3) == 0) { count3++ }5(($0 % 5) == 0) { count5++ }7(($0 % 7) == 0) { count7++ }[Control-D]END { print count2, count3, count5, count7 }
The output is an AWK script similar to the script we wrote by hand earlier in order to count 2 and 3 multiples. In other words, our AWK program produces another AWK script which will count the multiples of 2, 3, 5 and 7. That's great, but running the above on the fly is of not much use because the output is lost. In order to create a script from the above output, we shall redirect output to a file:
awk -f morecount.awk >count2357.awk2357[Control-D]
The count2357.awk
file will contain the final AWK script which shall
be used to count the multiples of 2, 3, 5 and 7 for any input numbers:
(($0 % 2) == 0) { count2++ } (($0 % 3) == 0) { count3++ } (($0 % 5) == 0) { count5++ } (($0 % 7) == 0) { count7++ } END { print count2, count3, count5, count7 }
We've just used AWK as an AWK program generator! That means, we can now run:
awk -f count2357.awk input_file
The above command line will count multiples of 2, 3, 5 and 7 for all numbers
in input_file
and print the results to the standard output.
However we didn't write count2357.awk
by hand but we ran another
AWK script to produce count2357.awk
script for us.
Cool!
Counting better
We'll implement the multiple counter program using another method. There are two reasons for doing so: first, this is an AWK tutorial and that's the way to do such kind of things in tutorials. The second reason is to make the program better, faster and functional; supplying the check numbers to a separate program to produce a script and then run another program using that script may be an interesting approach, but it would be much better to run a single command.
We choose to specify check numbers in the command line, let's say as a comma separated list of numbers. We also choose to use some kind of looping in the core logic of counting multiples, because to carry out a series of similar checks, one after the other, may be the right thing to do as a human, but for the computer that's a totally wrong way to do it.
BEGIN { # Parameter check shall be defined in the command line # via -v option, as a comma separated list of numbers. # Split the comma separated list and put the check # numbers in an array. n = split(check, a, ",") # If no numbers have been specified, then do nothing. if (n == 0) exit(0) # Create an associative array indexed by the given numbers. for (i in a) counter[a[i]] = 0 } { # For every input number, iterate all the check numbers and # check the remainders; if the input number is a multiple of # a check number, then increase the corresponding counter. for (i in counter) { if (($0 % i) == 0) counter[i]++ } } END { # Input has been read, so print the counters for every one # of the check numbers given. for (i in counter) print i, counter[i] }
To run the above AWK script for 2, 3, 5 and 7:
awk -v check="2,3,5,7" -f countbetter.awk input_file
Regular Expressions
Regular Expressions play major role in AWK. But what are Regular Expressions (REs)? Regular expressions originated in 1951, when Stephen Cole Kleene described regular languages using his mathematical notation called regular sets. Nowdays REs are used mainly for describing patterns to be matched by computer programs such as text editors, interpreters, compilers etc. In conclusion, we can define Regular Expressions to be a convenient way to describe text patterns.
In order to use REs one must know the relative notation, e.g. the
dot symbol (.
) means any single character, so the
RE Sm.th
matches Smith
, Smyth
and Smath
.
Characters enclosed in brackets mean any of the enclosed characters,
so RE [sS]m.th
matches Smith
and Smyth
, but also
matches smith
, smyth
, sm@th
and sm!th
.
Brackets can be used with character ranges, such as [a‑z]
to
match any lower english letter, or [a‑zA‑Z]
to match any
english letter (lower or capital). In order to match numerical digits,
[0‑9]
can be used.
There also exists a negation character (^
) to be used in brackets,
so [^a‑zA‑Z]
can be used to match any character except
english letters.
The asterisk (*
) is used to denote repetition, so RE
[0‑9]*
means a string of numerical digits (even none),
while the plus symbol (+
) has the same effect except
that there must be at least one item to match.
RE [ABC][0‑9]*[XYZ]
matches A12X
,
B11Z
and CY
, while RE [ABC][0‑9]+[XYZ]
does not match CY
because there must be at least one numerical digit
between [ABC]
and [XYZ]
patterns.
The \{ and \}
sequences can be used to
denote specific number of repetitions, e.g.
[ABC][0‑9]\{3\}[XYZ]
means letter A, B or C followed by three
numerical digits, followed by letter X, Y or Z,
[ABC][0‑9]\{3,6\}[XYZ]
means letter A, B or C followed by three,
four, five or six numerical digits, followed by letter X, Y or Z, while
[ABC][0‑9]\{3,\}[XYZ]
means letter A, B or C followed by at least three
numerical digits, followed by letter X, Y or Z, while
Symbols ^
and $
are called anchor characters and denote
the start and the end of the checked item respectively.
The RE [ABC][0‑9]+
matches panosA1962xyz
but ^[ABC][0‑9]+
matches
A1962xyz
but not panosA1962xyz
.
Similarily, [ABC][0‑9]*$
matches panosA1962
, but not
panos1962A1962xyz
.
Parentheses can be used in order to clarify ambiguities, while the pipe symbol
|
means or, so if we want to match names panos or
maria followed by four numerical digits, we can use (panos|maria)[0‑9]\{4\}
.
REs can be very confusing, but when properly used they can save you a lot lines of code.
AWK talks the language of REs out of the box, e.g. FS
and RS
can be set to REs instead of single characters or literal strings. For example,
setting FS
to [^a‑zA‑Z0‑9]+
means that
any sequence of non‑letter or non‑digit characters will be considered as a field
separator.
Another use of REs in AWK is the notation /RE/
which can be used
to match lines, while the tilde symbol (~
) is called the match
operator and can be used for pattern matching checks on any textual entity.
The following awk script prints lines that contain strings in the form of
[ABC][0‑9]+[XYZ]
and lines where the second field is Smith
or Smyth
:
/[ABC][0‑9]+[XYZ]/ { print $0 next } $2 ~ /^Sm[iy]th$/ { print $0 next }
Arrays
Arrays in AWK are associative lists, that is named lists of key/value pairs.
To iterate over the array elements one can use a special kind of a
for
loop:
for (variable_name in array_name) { do something interesting here with each one of the array indices and elements }e.g.
for (i in person) { print i, person[i] }
The above AWK code snippet uses variable i
to iterate all of the
person
array elements, printing each index along with the corresponding
person
array value.
Of course, arrays can have sequential numeric indices, e.g. 1, 2, 3,…
It's clear that such arrays can be iterated using a classical
for
syntax:
for (i = 1; i < 100; i++) { print item[i] }
Functions
Functions play a major role in every language and AWK is not an exception. Like what's happening in most of the programming languages, there exist two kinds of functions in AWK: bult‑in functions and user defined functions.
Built‑in functions
Built‑in functions include most of the classical functions found in standard function libraries of most languages. Measuring the length of a string, calcluating the square root of a number, file open/close, get the system time etc, are usually tasks covered by built‑in functions. For a complete (and updated) list of AWK's built‑in functions, one can always refer to the relevant manual pages starting from here.
Let's see built‑in functions in action.
To print statistics concerning the length of the words in a given text file,
we can use the length
built‑in function:
{ for (i = 1; i <= NF; i++) count[length($i)]++ } END { for (len in count) print len, count[len] }
The script above is straightforward, but let's trace it by hand given the following input lines:
The quick brown fox jumps over the lazy dog. The five boxing wizards jump quickly.
The first field of the first line is "The" with a length of 3 letters,
so count[3]
will be increased and because it was unset it will be set to 1.
The second field of the first line is "quick" with a length of 5 letters,
so count[5]
will be set to 1 for the same reasons.
The third field is "brown", again with a length of 5 letters,
so count[5]
will be increased to 2.
Next field is "fox" with a length of 3 letters,
so count[3]
will be increased to 2.
Reaching the last field of the first input line, we may notice a problem.
Actually, the last field will not be set to the word "dog" as someone would expect, but rather to the string "dog." with the dot included. The correspondig length will be 4 instead of 3 and our statistics will be proven faulty; this is absolutely normal, because AWK splits lines using sequences of "white" characters (spaces and tabs) as the default field separator.
It's easy for AWK to change the default field separator: just set the
FS
built‑in variable to a single character or to a regular expression.
The hard part is how to choose the appropriate field separator in order to
split lines into real life words.
We can set the field separator to a sequence of non alphanumeric characters in
the BEGIN
section:
BEGIN { FS = "[^[:alnum:]]+" } ...
Now we are getting better results, but there showed up a zero length counter! Don't panic, it's absolutely normal to have zero length "words" after changing the field separator. Actually, whenever a non alphanumeric character exists at the end of a line causes AWK to "see" an empty field after the separator and before the end of the line. To make that clear let's check the following line using "," as the field separator:
panos,arnold,tim
There are clearly three fields in the line above, namely "panos", "arnold" and "tim". Now let's split the following line with the same field separator:
,panos,arnold,tim,
Now there exist two extra (empty) fields, one before "panos" and the other after "tim". In our case the last fields of the first line are "lazy" (separated with spaces), "dog" (separated with a space to the left and a dot to the right). But there exists an empty field after the dot too; this is where the zero length word comes into play for the first time. The same is true for the second line which also ends with a dot.
It was easy to change the field separator in order to split lines in real life words. It's also easy to avoid printing the zero length stats just by deleting the zero indexed element before printing the array:
... END { delete count[0] for (len in count) print len, count[len] }
For a complete list of built‑in functions refer to the relevant chapter of the AWK's reference manual.
User defined functions
Built‑in functions cover many needs, but most of the time we need to write our
own functions for our programs to be clean and modular.
Use of functions helps the most in avoiding code repetition,
decreases program size and reduces global objects.
Function definition in AWK is starightforward, just write the keyword function
,
followed by the function's name, followed by the function's parameters enclosed in parantheses,
followed by the function's body enclosed in curly brackets:
function mean_calc(l, a) { ... }
In the above AWK code snippet we define function calc_mean
which accepts two
arguments (l
and a
).
Functions can return a value to the caller, but it's not mandatory to return something
(subroutine).
{ sep = "" for (i = 0; i <= NF; i++) { printf sep mean_calc($i) sep = " " } } function mean_calc(l) { n = split(l, a, ",") tot = 0 for (i in a) tot += a[i] return n ? tot / n : "NaN" }
The above AWK script will read lines where each line contains groups of comma separated
numerical values. For every input line we print the mean values of each group.
Of course there is a fatal error in this script as variable i
is defined
and used as global, but later on we use the same variable in the user defined function
mean_calc
.
In order to get things right we must use another variable inside the function, let's say
j
, but it's clear that this is not safe; we cannot keep track of global
variables and change variable names whenever a conflict appears.
Then, how can we eliminate global variables in an AWK script?
Local variables
The only variables with local scope in AWK are function parameters which have function scope. In other words, the parameters passed to an AWK function are local to the function and even if there exist other variables with the same name outside the function, these variables don't relate by any means with each other.
{ sep = "" for (i = 0; i <= NF; i++) { printf sep mean_calc($i) sep = " " } } function mean_calc(l, n, a, tot, i) { n = split(l, a, ",") tot = 0 for (i in a) tot += a[i] return n ? tot / n : "NaN" }
As you can see, we've added some function parameters to mean_calc
function.
Function parameters have local scope, so variables n
, a
,
tot
and i
become local to the mean_calc
function.
Of course we don't pass these parameters when calling the function; we just pass the
comma separated list of numerical values (this is just a string) as the l
parameter and ommit the remaining arguments.
This is a clever trick to define local variables with function scope in AWK,
but always there is a price to pay for that kind of elegance and simplicity:
AWK functions cannot accept variable number of (normal) arguments.