AWK tutorial

AWK refers to the program itself which may have been implemented as awk, nawk, mawk, gawk etc, or to the language in which AWK programs are written. So, we can say that AWK is a heavily used software tool (referring to the program itself), or that AWK is an easy to learn language (referring to the AWK language). AWK can be used as an autonomous program to be executed in the command line during an interactive shell session, but it's more likely to use AWK in shell scripts, usually in conjuction with other software tools.

AWK programs may be just one‑liners given on the fly as the first AWK's command line argument, but it's not rare for an AWK program to extend in hundreds or even thousands of lines; if an AWK program counts more than one line, then it's easier (and safer) to save the program in a file and use ‑f option in the command line. We can use multiple scripts using ‑f more than once in the command line; in this case the programs are concatenated and run as a whole.

Beyond static AWK scripts written in specified files, it's a common practice to construct AWK scripts on the fly from other running programs, save them in temporary files, run AWK with the constructed programs and then remove the scripts.

AWK syntax

Every AWK program is just a pattern/action series:

pattern { action }
pattern { action }
...

Every input line is checked against each pattern in turn. If the line matches a pattern, then the coreesponding action is taken. To skip further pattern checks after a taken action, one can use the next keyword in the corresponding action.

There exist two special "patterns". The BEGIN pattern action is executed before any input has been read, while the END pattern action is executed just before AWK program exits. The following program print "Hello!" before any input line is read, then counts lines with an even number of fields, and lines with more than five fields. After all input lines have been read, the program prints the two counters, then prints "Bye..." and exit:

BEGIN		{ print "Hello!"; count2 = 0; count5 = 0 }
(NF % 2) == 0	{ count2++ }
(NF > 5)	{ count5++ }
END		{ print count2, count5; print "Bye..." }

Download

An AWK program may have no actions at all; in that case AWK just prints input to output just like cat program does. If a pattern lacks an action, then every line that matches the patern is printed (default action). If an action lacks a pattern, then this action is executed for all lines reached this pattern. There is nothing else to say about the basic syntax of every AWK program.

Counting even numbers

We start this tutorial course by writing a program to count all even numbers in input. Input consists of lines, where each line contains an integer number. To count all even numbers in the input stream we devide each number by 2 and check the remainder to be 0:

(($0 % 2) == 0) { count++ }
END { print count }

Download

Did you know?
As you might have already guessed, $0 refers to the current input line, while NF refers to the number of fields in the current input line. By the way, $1 refers to the first field of the current input line, $2 refers to the second field of the current input line and so on.

Now let's make some changes in order to count not only evens, but also multiples of 3. Our first (and obvious) approach is to use two counters istead of one, then add another pattern/action concerning division by 3 remainders and finally to print the two counters instead of one:

(($0 % 2) == 0) { count2++ }
(($0 % 3) == 0) { count3++ }
END { print count2, count3 }

Download

In order to understand how AWK is working, we'll trace AWK's logic by hand for some input numbers. Let's say we get number 4 in the input. AWK checks number 4 against the first given pattern, that is if the remainder of 4 devided by 2 is 0; this pattern is matched by number 4, therefore the relevant action is taken, that is count2 will be increased; because it's the first time that variable count2 comes into play, it has no value, so adding 1 to count2 set that variable to 1, as undefined variables involved in numerical expressions will be considered as zero valued. By the way, undefined variables involved in alphahnumeric (string) expressions, will be considered as empty strings.

After finishing with the first pattern/action, AWK moves to the next pattern/action which is to check the number against division by 3 which clearly fails, as 4 is not a multiple of 3. Therefore no action is taken and there exist no more pattern/action entitites to be processed, so AWK reads the next input line, which let's assume is the number 11.

Clearly, 11 doesn't divide exactly neither by 2, nor by 3, so AWK takes no action and reads the next input line, let's say the number 12. Number 12 is devided by 2, so count2 will be increased. Number 12 is also devided by 3, so count3 will be increased as well. The same procedure will be followed for every input line. After reading all input lines, AWK will execute the END action which is to print count2 and count3 counters, and then exits. Mission accomplished!

More counting…

Our second approach in counting evens and multiples of three is to make the number counting AWK program more flexible, that is not just to count multiples of 2 and 3, but to count multiples of any given numbers; we'll call these numbers as check numbers. Clearly, there are two tasks involved in that process:

  • Specify check numbers
  • Read input and count multiples
  • Following the above scheme we may run the program in two phases. Because this is an AWK tutorial, we choose to implement both of these two phases using AWK. First phase consists of constructing another AWK script to be used in the second phase:

    BEGIN {
    	sep = "print"
    }
    
    {
    	print "(($0 % " $0 ") == 0) { count" $0 "++ }"
    	cntcmd = cntcmd sep " count" $0
    	sep = ","
    }
    
    END {
    	print "END { " cntcmd " }"
    }
    

    Download

    Assuming that the above script is saved in morecount.awk, run AWK and type the check numbers 2, 3, 5 and 7:

    awk -f morecount.awk
    
    2
    (($0 % 2) == 0) { count2++ }
    3
    (($0 % 3) == 0) { count3++ }
    5
    (($0 % 5) == 0) { count5++ }
    7
    (($0 % 7) == 0) { count7++ }
    [Control-D]
    END { print count2, count3, count5, count7 }

    The output is an AWK script similar to the script we wrote by hand earlier in order to count 2 and 3 multiples. In other words, our AWK program produces another AWK script which will count the multiples of 2, 3, 5 and 7. That's great, but running the above on the fly is of not much use because the output is lost. In order to create a script from the above output, we shall redirect output to a file:

    awk -f morecount.awk >count2357.awk
    
    2
    3
    5
    7
    [Control-D]

    The count2357.awk file will contain the final AWK script which shall be used to count the multiples of 2, 3, 5 and 7 for any input numbers:

    (($0 % 2) == 0) { count2++ }
    (($0 % 3) == 0) { count3++ }
    (($0 % 5) == 0) { count5++ }
    (($0 % 7) == 0) { count7++ }
    END { print count2, count3, count5, count7 }
    

    We've just used AWK as an AWK program generator! That means, we can now run:

    awk -f count2357.awk input_file
    

    The above command line will count multiples of 2, 3, 5 and 7 for all numbers in input_file and print the results to the standard output. However we didn't write count2357.awk by hand but we ran another AWK script to produce count2357.awk script for us. Cool!

    Counting better

    We'll implement the multiple counter program using another method. There are two reasons for doing so: first, this is an AWK tutorial and that's the way to do such kind of things in tutorials. The second reason is to make the program better, faster and functional; supplying the check numbers to a separate program to produce a script and then run another program using that script may be an interesting approach, but it would be much better to run a single command.

    We choose to specify check numbers in the command line, let's say as a comma separated list of numbers. We also choose to use some kind of looping in the core logic of counting multiples, because to carry out a series of similar checks, one after the other, may be the right thing to do as a human, but for the computer that's a totally wrong way to do it.

    BEGIN {
    	# Parameter check shall be defined in the command line
    	# via -v option, as a comma separated list of numbers.
    	# Split the comma separated list and put the check
    	# numbers in an array.
    
    	n = split(check, a, ",")
    
    	# If no numbers have been specified, then do nothing.
    
    	if (n == 0)
    	exit(0)
    
    	# Create an associative array indexed by the given numbers.
    
    	for (i in a)
    	counter[a[i]] = 0
    }
    
    {
    	# For every input number, iterate all the check numbers and
    	# check the remainders; if the input number is a multiple of
    	# a check number, then increase the corresponding counter.
    
    	for (i in counter) {
    		if (($0 % i) == 0)
    		counter[i]++
    	}
    }
    
    END {
    	# Input has been read, so print the counters for every one
    	# of the check numbers given.
    
    	for (i in counter)
    	print i, counter[i]
    }
    

    Download

    To run the above AWK script for 2, 3, 5 and 7:

    awk -v check="2,3,5,7" -f countbetter.awk input_file
    

    Regular Expressions

    Regular Expressions play major role in AWK. But what are Regular Expressions (REs)? Regular expressions originated in 1951, when Stephen Cole Kleene described regular languages using his mathematical notation called regular sets. Nowdays REs are used mainly for describing patterns to be matched by computer programs such as text editors, interpreters, compilers etc. In conclusion, we can define Regular Expressions to be a convenient way to describe text patterns.

    In order to use REs one must know the relative notation, e.g. the dot symbol (.) means any single character, so the RE Sm.th matches Smith, Smyth and Smath. Characters enclosed in brackets mean any of the enclosed characters, so RE [sS]m.th matches Smith and Smyth, but also matches smith, smyth, sm@th and sm!th. Brackets can be used with character ranges, such as [a‑z] to match any lower english letter, or [a‑zA‑Z] to match any english letter (lower or capital). In order to match numerical digits, [0‑9] can be used. There also exists a negation character (^) to be used in brackets, so [^a‑zA‑Z] can be used to match any character except english letters.

    The asterisk (*) is used to denote repetition, so RE [0‑9]* means a string of numerical digits (even none), while the plus symbol (+) has the same effect except that there must be at least one item to match. RE [ABC][0‑9]*[XYZ] matches A12X, B11Z and CY, while RE [ABC][0‑9]+[XYZ] does not match CY because there must be at least one numerical digit between [ABC] and [XYZ] patterns. The \{ and \} sequences can be used to denote specific number of repetitions, e.g. [ABC][0‑9]\{3\}[XYZ] means letter A, B or C followed by three numerical digits, followed by letter X, Y or Z, [ABC][0‑9]\{3,6\}[XYZ] means letter A, B or C followed by three, four, five or six numerical digits, followed by letter X, Y or Z, while [ABC][0‑9]\{3,\}[XYZ] means letter A, B or C followed by at least three numerical digits, followed by letter X, Y or Z, while

    Symbols ^ and $ are called anchor characters and denote the start and the end of the checked item respectively. The RE [ABC][0‑9]+ matches panosA1962xyz but ^[ABC][0‑9]+ matches A1962xyz but not panosA1962xyz. Similarily, [ABC][0‑9]*$ matches panosA1962, but not panos1962A1962xyz. Parentheses can be used in order to clarify ambiguities, while the pipe symbol | means or, so if we want to match names panos or maria followed by four numerical digits, we can use (panos|maria)[0‑9]\{4\}. REs can be very confusing, but when properly used they can save you a lot lines of code.

    AWK talks the language of REs out of the box, e.g. FS and RS can be set to REs instead of single characters or literal strings. For example, setting FS to [^a‑zA‑Z0‑9]+ means that any sequence of non‑letter or non‑digit characters will be considered as a field separator. Another use of REs in AWK is the notation /RE/ which can be used to match lines, while the tilde symbol (~) is called the match operator and can be used for pattern matching checks on any textual entity. The following awk script prints lines that contain strings in the form of [ABC][0‑9]+[XYZ] and lines where the second field is Smith or Smyth:

    /[ABC][0‑9]+[XYZ]/ {
    	print $0
    	next
    }
    
    $2 ~ /^Sm[iy]th$/ {
    	print $0
    	next
    }
    

    Arrays

    Arrays in AWK are associative lists, that is named lists of key/value pairs. To iterate over the array elements one can use a special kind of a for loop:

    for (variable_name in array_name) {
    	do something interesting here with each
    	one of the array indices and elements
    }
    

    e.g.

    for (i in person) {
    	print i, person[i]
    }
    

    The above AWK code snippet uses variable i to iterate all of the person array elements, printing each index along with the corresponding person array value.

    Of course, arrays can have sequential numeric indices, e.g. 1, 2, 3,… It's clear that such arrays can be iterated using a classical for syntax:

    for (i = 1; i < 100; i++) {
    	print item[i]
    }
    

    Functions

    Functions play a major role in every language and AWK is not an exception. Like what's happening in most of the programming languages, there exist two kinds of functions in AWK: bult‑in functions and user defined functions.

    Built‑in functions

    Built‑in functions include most of the classical functions found in standard function libraries of most languages. Measuring the length of a string, calcluating the square root of a number, file open/close, get the system time etc, are usually tasks covered by built‑in functions. For a complete (and updated) list of AWK's built‑in functions, one can always refer to the relevant manual pages starting from here.

    Let's see built‑in functions in action. To print statistics concerning the length of the words in a given text file, we can use the length built‑in function:

    {
    	for (i = 1; i <= NF; i++)
    	count[length($i)]++
    }
    
    END {
    	for (len in count)
    	print len, count[len]
    }
    

    Download

    The script above is straightforward, but let's trace it by hand given the following input lines:

    The quick brown fox jumps over the lazy dog.
    The five boxing wizards jump quickly.
    

    The first field of the first line is "The" with a length of 3 letters, so count[3] will be increased and because it was unset it will be set to 1. The second field of the first line is "quick" with a length of 5 letters, so count[5] will be set to 1 for the same reasons. The third field is "brown", again with a length of 5 letters, so count[5] will be increased to 2. Next field is "fox" with a length of 3 letters, so count[3] will be increased to 2. Reaching the last field of the first input line, we may notice a problem.

    Actually, the last field will not be set to the word "dog" as someone would expect, but rather to the string "dog." with the dot included. The correspondig length will be 4 instead of 3 and our statistics will be proven faulty; this is absolutely normal, because AWK splits lines using sequences of "white" characters (spaces and tabs) as the default field separator.

    It's easy for AWK to change the default field separator: just set the FS built‑in variable to a single character or to a regular expression. The hard part is how to choose the appropriate field separator in order to split lines into real life words. We can set the field separator to a sequence of non alphanumeric characters in the BEGIN section:

    BEGIN {
    	FS = "[^[:alnum:]]+"
    }
    ...
    

    Now we are getting better results, but there showed up a zero length counter! Don't panic, it's absolutely normal to have zero length "words" after changing the field separator. Actually, whenever a non alphanumeric character exists at the end of a line causes AWK to "see" an empty field after the separator and before the end of the line. To make that clear let's check the following line using "," as the field separator:

    panos,arnold,tim
    

    There are clearly three fields in the line above, namely "panos", "arnold" and "tim". Now let's split the following line with the same field separator:

    ,panos,arnold,tim,
    

    Now there exist two extra (empty) fields, one before "panos" and the other after "tim". In our case the last fields of the first line are "lazy" (separated with spaces), "dog" (separated with a space to the left and a dot to the right). But there exists an empty field after the dot too; this is where the zero length word comes into play for the first time. The same is true for the second line which also ends with a dot.

    It was easy to change the field separator in order to split lines in real life words. It's also easy to avoid printing the zero length stats just by deleting the zero indexed element before printing the array:

    ...
    END {
    	delete count[0]
    
    	for (len in count)
    	print len, count[len]
    }
    

    For a complete list of built‑in functions refer to the relevant chapter of the AWK's reference manual.

    User defined functions

    Built‑in functions cover many needs, but most of the time we need to write our own functions for our programs to be clean and modular. Use of functions helps the most in avoiding code repetition, decreases program size and reduces global objects. Function definition in AWK is starightforward, just write the keyword function, followed by the function's name, followed by the function's parameters enclosed in parantheses, followed by the function's body enclosed in curly brackets:

    function mean_calc(l, a) {
    	...
    }
    

    In the above AWK code snippet we define function calc_mean which accepts two arguments (l and a). Functions can return a value to the caller, but it's not mandatory to return something (subroutine).

    {
    	sep = ""
    
    	for (i = 0; i <= NF; i++) {
    		printf sep mean_calc($i)
    		sep = " "
    	}
    }
    
    function mean_calc(l) {
    	n = split(l, a, ",")
    
    	tot = 0
    
    	for (i in a)
    	tot += a[i]
    
    	return n ? tot / n : "NaN"
    }
    

    The above AWK script will read lines where each line contains groups of comma separated numerical values. For every input line we print the mean values of each group. Of course there is a fatal error in this script as variable i is defined and used as global, but later on we use the same variable in the user defined function mean_calc. In order to get things right we must use another variable inside the function, let's say j, but it's clear that this is not safe; we cannot keep track of global variables and change variable names whenever a conflict appears. Then, how can we eliminate global variables in an AWK script?

    Local variables

    The only variables with local scope in AWK are function parameters which have function scope. In other words, the parameters passed to an AWK function are local to the function and even if there exist other variables with the same name outside the function, these variables don't relate by any means with each other.

    {
    	sep = ""
    
    	for (i = 0; i <= NF; i++) {
    		printf sep mean_calc($i)
    		sep = " "
    	}
    }
    
    function mean_calc(l,		n, a, tot, i) {
    	n = split(l, a, ",")
    
    	tot = 0
    
    	for (i in a)
    	tot += a[i]
    
    	return n ? tot / n : "NaN"
    }
    

    Download

    As you can see, we've added some function parameters to mean_calc function. Function parameters have local scope, so variables n, a, tot and i become local to the mean_calc function. Of course we don't pass these parameters when calling the function; we just pass the comma separated list of numerical values (this is just a string) as the l parameter and ommit the remaining arguments. This is a clever trick to define local variables with function scope in AWK, but always there is a price to pay for that kind of elegance and simplicity: AWK functions cannot accept variable number of (normal) arguments.