SPAWK — SQL Powered AWK
Introduction
Based on two-way pipes, known as coprocesses in gawk
's jargon,
SPAWK forms a tiny library of AWK functions that make possible for AWK to communicate
with MySQL/MariaDB databases.
Using SPAWK, one can submit SQL queries to MySQL/MariaDB database servers from within
AWK scripts and process the result rows in the same AWK script.
DDL/DML SQL queries may be submitted as well.
Let's take a look at what a SPAWK script may look like:
@include "spawk.awk" BEGIN { spawk_sesami("chat", "xxx", "chat") } { count_relations($1) } function count_relations(user, count) { spawk_query("SELECT `relationship` FROM `relation` WHERE `user` LIKE '" user "'") while (spawk_fetchrow(row, 1)) count[row[0]]++ print user, count["FRIEND"] + 0, count["BLOCKED"] + 0 }
The above SPAWK script includes the SPAWK library at first.
In the BEGIN
section, the database credentials are being set.
For every input line, the first field is considered to be a user login name
for which the program will select the related users and count the selected
relations based on the relation kind.
Then, the user login name along with the friends and blocked
counts for this user will be printed.
Notice!
This is an introductory, light version of SPAWK.
While you can download and run this version of SPAWK in your local system,
it's strongly recommended not to do so.
Just read this page as you may read any article in your favourite Linux magazine.
If you find SPAWK interesting, visit the official SPAWK site at
spawk.opasopa.net
and download the real SPAWK following the instructions there.
A similar SPAWK script can be used to print friends and blocked counts for ALL registered users:
@include "spawk.awk" BEGIN { spawk_sesami("chat", "xxx", "chat") spawk_query("SELECT `login` FROM `user`") while (spawk_fetchrow(user, 1)) count_relations(user[0]) } function count_relations(user, count) { spawk_query("SELECT `relationship` FROM `relation` WHERE `user` = '" user "'") while (spawk_fetchrow(row, 1)) count[row[0]]++ print user, count["FRIEND"] + 0, count["BLOCKED"] + 0 }
The chat application
Assume a chat application based on a relational database holding user accounts and relations between users. Every user may create relations of type "FRIEND" or "BLOCKED" with other registered users of the application. A simple database schema for the chat database would be:
-- Application "chat" is about web chatting between application users. -- Behind "chat" application, lies the "chat" database were user data and -- user relationships are stored. -- (Re)create the "chat" database. DROP DATABASE IF EXISTS `chat` ; CREATE DATABASE `chat` DEFAULT CHARSET = utf8 DEFAULT COLLATE = utf8_general_ci ; -- Select "chat" database as the default database. USE `chat` ; -- Table "user" is the most significant table of the "chat" database. -- Each row represents a registered user of the "chat" application. CREATE TABLE `user` ( `login` VARCHAR(64) NOT NULL COMMENT 'Login name', `registration` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Registration date', `name` VARCHAR(128) NOT NULL COMMENT 'Full name', `email` VARCHAR(128) NULL DEFAULT NULL COMMENT 'e-mail', -- user passwords are stored in SHA1 `password` CHARACTER(40) COLLATE utf8_bin NOT NULL COMMENT 'Password', PRIMARY KEY ( `login` ) USING BTREE, INDEX ( `name` ) USING BTREE, UNIQUE INDEX ( `email` ) USING BTREE ) ENGINE = InnoDB COMMENT = 'Person table' ; -- Table "relation" holds the relationships between users. -- There are three kinds of relationships between any two users: -- The users are unrelated, friends or blocked. CREATE TABLE `relation` ( `user` VARCHAR(64) NOT NULL COMMENT 'Login name', `related` VARCHAR(64) NOT NULL COMMENT 'Related login name', `relationship` ENUM ( 'FRIEND', 'BLOCKED' ) NOT NULL COMMENT 'Kind of relationship', `when` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Date of relation', PRIMARY KEY ( `user`, `related` ) USING BTREE, INDEX ( `related` ) USING BTREE ) ENGINE = InnoDB COMMENT = 'Relations table' ; ALTER TABLE `relation` ADD FOREIGN KEY ( `user` ) REFERENCES `user` ( `login` ) ON UPDATE CASCADE ON DELETE CASCADE ; ALTER TABLE `relation` ADD FOREIGN KEY ( `related` ) REFERENCES `user` ( `login` ) ON UPDATE CASCADE ON DELETE CASCADE ; -- Create user for generic DQL/DML access to "chat" database -- Warning!!! -- Option "IF EXISTS" produces an error prior to MySQL version 5.7 -- In that case, drop the user manually and delete DROP command -- from this file. Then run the (modified) SQL script again to -- create the "chat" database and the user "chat". DROP USER IF EXISTS 'chat'@'localhost' ; CREATE USER 'chat'@'localhost' IDENTIFIED BY 'xxx' ; GRANT SELECT, INSERT, UPDATE, DELETE ON `chat`.* TO 'chat'@'localhost' ;
Creating the database
Assuming that the above database schema is stored in schema.sql
file,
run the following command as the root user of the database to (re)create the database:
Warning!!!
The following command will drop any already existing chat database
in your system; all relevant data will be erased!
If there exists a database user named chat, this user will also
be dropped!
mysql -u root -p <schema.sql
Populating the database
After creating the chat database and the chat generic database user granted with DQL/DML database access, we want to populate the database with test data in order to develop programs and run various test suits for the chat application. We'll use SPAWK and PPL AWK libraries for populating the database with random data:
gawk -v ucount=1000 -v maxrel=10 -f populate.awk
After running for a couple of seconds, your chat database will be populated with 1000 user rows, and about 50000 relation rows. There exists a possibility of unique constraints violations on user insert operations; in such a case your database may contain less than 1000 user rows. We can avoid duplications writing some extra SPAWK code, but it's not worthwhile to do so.
It's time to take a look at the populate.awk
SPAWK script:
@include "spawk.awk" @include "ppl.awk" BEGIN { srand(PROCINFO["pid"]) spawk_sesami("chat", "xxx", "chat") # "ucount" is the number of users to be inserted. # "ucount" may be defined in the command line via -v option ucount += 0 # After inserting users in the database, every user # will obtain a random number of relations with other # users. The number of relations to be inserted for # each user is random but less than "maxrel" which # can be defined in the command line via -v oprion. maxrel += 0 relation_populate(user_populate()) } function user_populate( i, user) { nuser = 0 while (ucount-- > 0) { user["login"] = ppl_login() user["registration"] = ppl_timestamp() user["name"] = ppl_name() user["email"] = ppl_email() user["password"] = ppl_string(6, 20, ppl_lower) if (spawk_submit("INSERT INTO `user` (`login`, `registration`, " \ "`name`, `email`, `password`) VALUES ('" user["login"] "', " \ "FROM_UNIXTIME(" user["registration"] "), '" user["name"] "', " \ "'" user["email"] "', SHA1('" user["password"] "')) " \ "ON DUPLICATE KEY UPDATE `name` = VALUES(`name`)") != 1) continue # duplicates return 2, not 1 ulist[++nuser] = user["login"] } return nuser } function relation_populate(nuser, i, nrel, query) { for (i = 1; i <= nuser; i++) { nrel = ppl_integer(1, maxrel) query = "INSERT INTO `relation` (`user`, `related`, `relationship`) VALUES " while (nrel-- > 1) query = query relation_row(ulist[i]) "," query = query relation_row(ulist[i]) \ " ON DUPLICATE KEY UPDATE `relationship` = VALUES(`relationship`)" spawk_submit(query) } } function relation_row(user, related) { while((related = ulist[ppl_integer(1, nuser)]) == user) ; return "('" user "', '" ulist[ppl_integer(1, nuser)] "', '" \ (ppl_integer(0, 1) ? "FRIEND" : "BLOCKED") "')" }
Printing relations
After populating the chat database, it's time to make productive use of SPAWK. Our first program will print the counts of relations declared for specified users.
@include "spawk.awk" BEGIN { spawk_sesami("chat", "xxx", "chat") } { delete count spawk_query("SELECT `relationship` FROM `relation` WHERE `related` LIKE '" $0 "'") while (spawk_fetchrow(relation, 1)) count[relation[0]]++ print $0, count["FRIEND"] + 0, count["BLOCKED"] + 0 }
Run the above script and type in some user login names from your keyboard. For each name you'll get the results, that is the user login name, the count of users declared the specified user as "FRIEND", and the count of users declared the specified user as "BLOCKED":
gawk -f relpr.awkasdasdasdasd 9 12sdfhwwsdfhww 3 8gfuuzsygfuuzsy 19 5...sdfuiruisdfuirui 2 6
Maybe the login names seem biazarre, but don't forget that this is a randomly populated database!
Anyway, we'll improve our SPAWK program adding an option to print all users
instead of selected users given as input.
To accomplish that, we can use an AWK variable which will be used as a flag:
if this variable has a non zero value, then all users will be printed,
else (default) the users login names will be read as input line by line.
Choosing all
as a name for this flag, our script becomes:
@include "spawk.awk" BEGIN { spawk_sesami("chat", "xxx", "chat") if (all) exit(print_all_users()) } { spawk_query("SELECT `login` FROM `user` WHERE `login` LIKE '" $0 "'") while (spawk_fetchrow(user, 1)) print_user(user[0]) } function print_user(user, relation, count) { spawk_query("SELECT `relationship` FROM `relation` WHERE `related` = '" user "'") while (spawk_fetchrow(relation, 1)) count[relation[0]]++ print user, count["FRIEND"] + 0, count["BLOCKED"] + 0 } function print_all_users( user) { spawk_query("SELECT `login` FROM `user`") while (spawk_fetchrow(user, 1)) print_user(user[0]) return 0 }
To run the above program for users supplied as input, just type:
gawk -f relprint.awk
But if you want to run the above program for all users in the database,
then you must supply a non-zero value to the variable all
and this can be accomplished via AWK -v
option in the
command line:
gawk -v all=1 -f relprint.awk
Writing application programs
It's time to enclose all of the above SPAWK stuff about printing relations
in a handy shell script.
In order to do so, we must take care of AWKPATH
environment
variable for spawk.awk
to be accessible from any directory.
Our application SPAWK scripts must also be located in a well known directory.
We can also specify absolute pathnames for AWK scripts, but it's
strongly recommended not to do so.
Let's assume that we are using /usr/local/share/awklib
directory for storing system AWK libraries, e.g. spawk.awk
for accessing databases, ppl.awk
for producing random
data etc.
In this case the AWKPATH
variable may be set as:
export AWKPATH=".:/usr/local/share/awklib"
Now, let's assume that our chat application uses
/usr/local/apps/chat
directory as a base directory.
A possible directory structure under this directory would
contain subdirectories as bin
, database
,
lib
etc. It's good to create an awklib
subdirectory under /usr/local/apps/chat
directory
in order to store various AWK scripts concerning the chat
application.
At last, assume that the base directory of the chat
application is stored in the CHAT_BASEDIR
environment variable:
export CHAT_BASEDIR="/usr/local/apps/chat"
Assuming the above chat enviroment, we'll develop the
relprint
program in /usr/local/apps/chat/bin
directory.
There is no need to write any complicated C program,
neither to write spaghetti SQL scripts.
All we need is to move the relprint.awk
SPAWK script to
/usr/local/apps/chat/awk
directory,
and then write a trivial shell script to call AWK properly.
Most of the job will be executed by SPAWK, leaving only trivial tasks
for the shell to do, like parsing command line arguments etc:
#!/use/bin/bash progname=`basename $0` errs= [ -z "${CHAT_BASEDIR}" ] && { echo "${progname}: CHAT_BASEDIR: variable not set" >&2 errs="yes" } [ -z "${AWKPATH}" ] && { echo "${progname}: AWKPATH: variable not set" >&2 errs="yes" } [ -n "${errs}" ] && exit 1 usage() { echo "usage: ${progname} { -a | [ files... ] }" >&2 exit 1 } all=0 while getopts ":a" opt do case "${opt}" in a) all=1 ;; \?) echo "${progname}: ${OPTARG}: invalid option" >&2 errs="yes" ;; esac done [ -n "${errs}" ] && usage shift `expr ${OPTIND} - 1` [ ${all} -eq 1 -a $# -ne 0 ] && usage exec gawk -v all=${all} -f "${CHAT_BASEDIR}/awklib/relprint.awk" $*
Program refinement
Now that our goal has been accomplished, we can make some program refinements.
To be more specific, we'll add an option for specifying the key column to be other
than the user login name, that is email or full name.
We'll also add some flexibility on what to print.
We choose not to change the relprint
program but rather to develop
the relsel
program and the corresponding SPAWK script:
@include "spawk.awk" BEGIN { OFS = "\t" # Flag "all" controls the overall program's behavior on # how to specify users for counting their popularity. # If "all" has a non-zero value (default) then the program # will read login names, full names or emails from standard input. # In that case, we can specify the key column by setting "ucol" # variable (default 1). all += 0 # Check the key (login/email/name) column. If not specified, # then the first column is assumed as the key column. ucol += 0 if (ucol <= 0) ucol = 1 # We have the option to specify users by keys other than the # user login name. To do so we must specify the key using the # "key" variable, e.g. awk -v key="email" will use user email # instead of user login name. valid_keys["login"] valid_keys["email"] valid_keys["name"] if (!key) key = "login" else if (!(key in valid_keys)) fatal(key ": invalid key") # By default the login name and the key will be printed. # However, we can specify which columns to print by setting # the "projection" variable to a list of any user column names. # The list may be separated by any special character, e.g. comma, # colon, space etc. if (!projection) projection = (key == "login" ? "login" : "login," key) nprjcols = split(projection, prjcols, "[^a-zA-Z0-9_]+") # The user select SQL clause is fixed, so we store this in # the "uselect" variable, as it will be used more than once # in the script uselect = "SELECT `login`, `name`, `email` FROM `user`" spawk_sesami("chat", "xxx", "chat") if (all) exit(process_all()) } NF < ucol { next } { spawk_query(uselect " WHERE `" key "` LIKE '" $ucol "'") while (spawk_fetchrow(user)) process_user(user) } function process_user(user, relation, count, i) { spawk_query("SELECT `relationship` FROM `relation` " \ "WHERE `related` = '" user[1] "'") while (spawk_fetchrow(relation, 1)) count[relation[0]]++ user["login"] = user[1] user["name"] = user[2] user["email"] = user[3] for (i = 1; i <= nprjcols; i++) printf user[prjcols[i]] OFS print count["FRIEND"] + 0, count["BLOCKED"] + 0 } function process_all( user) { spawk_query(uselect) while (spawk_fetchrow(user)) process_user(user) exit(0) } function fatal(msg) { print msg >"/dev/stderr" exit(2) }
The relsel
shell script follows. It's very similar to
relprint
, but there are some new options and the
SPAWK script passed to gawk
is relsel.awk
instead of
relprint.awk
:
#!/use/bin/bash progname=`basename $0` errs= [ -z "${CHAT_BASEDIR}" ] && { echo "${progname}: CHAT_BASEDIR: variable not set" >&2 errs="yes" } [ -z "${AWKPATH}" ] && { echo "${progname}: AWKPATH: variable not set" >&2 errs="yes" } [ -n "${errs}" ] && exit 1 usage() { echo "usage: ${progname} [ -k key ] [ -p projection ] { -a | [ files... ] }" >&2 exit 1 } all=0 key= projection= while getopts ":ak:p:" opt do case "${opt}" in a) all=1 ;; k) key="${OPTARG}" ;; p) projection="${OPTARG}" ;; \?) echo "${progname}: ${OPTARG}: invalid option" >&2 errs="yes" ;; esac done [ -n "${errs}" ] && usage shift `expr ${OPTIND} - 1` [ ${all} -eq 1 -a $# -ne 0 ] && usage exec gawk -v all=${all} -v key="${key}" -v projection="${projection}" \ -f "${CHAT_BASEDIR}/awklib/relsel.awk" $*
Relation statistics
We'll close this introductory SPAWK course by developing one more SPAWK script to produce some interesting statistics based on our chat user relations. To be more specific, we want to calculate the number of concordant as well as the number of contradictive user relations in the database. A relation is considered to be concordant if the reverse relation exists and is of the same kind; conversely, a relation is considered as contradictive if the reverse relation exists and is of a different kind. This seems not to be a trivial task to accomplish, but SPAWK comes to our rescue:
@include "spawk.awk" BEGIN { spawk_sesami("chat", "xxx", "chat") concordant = 0 contradictive = 0 spawk_query("SELECT `user`, `related`, `relationship` FROM `relation`") while (spawk_fetchrow(relation)) process_relation(relation) # concordant and contradictive relations have been counted twice, # once for the "user" and once for the "related". print concordant / 2, contradictive / 2 } function process_relation(relation, reverse) { # skip self relations if (relation[1] == relation[2]) return # select the reverse relation (if exists) spawk_query("SELECT `relationship` FROM `relation` " \ "WHERE `user` = '" relation[2] "' AND `related` = '" relation[1] "'") # if the reverse relation does not exist, then do nothing if (!spawk_fetchone(reverse, 1)) return # check for the same or different kind of the reverse relation if (reverse[0] == relation[3]) concordant++ else contradictive++ }
Of course there exist dark corners in our program,
most of them having to do with concurrency issues.
For example, when relation rows are being inserted, deleted or modified
while gawk
process is still running,
you may get weird (non integer) results in heavily used databases,
as some of the concordant or contradictive relationships may not be
count twice as they should.
Such undesirable situations are almost impossible to avoid without imposing
table locks, but doing so may lead to much more complicated situations.
As it has been said by Brian Kernighan, one of the creators of AWK:
Dark corners are basically fractal–no matter how much you illuminate,
there’s always a smaller but darker one.
Fixing anomalies
SPAWK is capable of executing DML and DDL queries too.
Based on relstats.awk
, the relfix.awk
script
will accept the introdel
flag in order to delete
self relationships, that is relations where user and related
is the same person.
Relation statistics have also been extended to include self relationships.
@include "spawk.awk" BEGIN { spawk_sesami("chat", "xxx", "chat") concordant = 0 contradictive = 0 # Non-zero "introdel" flag value causes self relationships to be delete. # SPAWK returns the number of affected rows on DML queries. introverted = introdel ? spawk_submit("DELETE FROM `relation` WHERE `user` = `related`") : 0 spawk_query("SELECT `user`, `related`, `relationship` FROM `relation`") while (spawk_fetchrow(relation)) process_relation(relation) # concordant and contradictive relations have been counted twice, # once for the "user" and once for the "related". print concordant / 2, contradictive / 2, introverted } function process_relation(relation, reverse) { # skip self relations if (relation[1] == relation[2]) return introverted++ # select the reverse relation (if exists) spawk_query("SELECT `relationship` FROM `relation` " \ "WHERE `user` = '" relation[2] "' AND `related` = '" relation[1] "'") # if the reverse relation does not exist, then do nothing if (!spawk_fetchone(reverse, 1)) return # check for the same or different kind of the reverse relation if (reverse[0] == relation[3]) concordant++ else contradictive++ }
To print relation statistics using relfix.awk
:
gawk -f relfix.awk146 130 22
To print relation statistics deleting any self relationships:
gawk -v introdel=1 -f relfix.awk146 130 22
Print relation statistics once again:
gawk -f relfix.awk146 130 0
The SPAWK library
The SPAWK library is just an AWK script that contains a BEGIN
section and function definitions; you can
view
the code online, but it's better to
download
the libray and view the code locally.
It's a good practice to locate the library in one of the AWKPATH
directories, so you can include it in your SPAWK scripts as follows:
@include "spawk.awk"
Alternatively you can include the SPAWK library in the command line using
the -f
option:
awk -f spawk.awk -f your_script data_files…
However, including the library in the application AWK script is more compact and gives the reader a clear sign that database usage is involved in the sciprt.
The SPAWK API
SPAWK API consists of a small subset of SPAWK library functions that can be called from AWK scripts. Actually there are less than 10 functions in the API, each belonging to one of these three categories: authentication, query submition, fetching results, miscellaneous functions.
spawk_sesami(user, password [, database])
spawk_sesami
is the one and only authentication function in the API.
The parameters are straightforward, namely the database user name,
the user's password and the database name (optional).
These parameters are used whenever a new client is to be created.
Usually spawk_sesami
function is called in the BEGIN
section of an AWK script.
spawk_query(query)
spawk_query
submits a DQL query to the database server for execution.
DQL queries are SQL queries that produce result rows; usually a DQL
query starts with a SELECT
statement.
After query submission via spawk_query
, all of the result rows must be
retrieved from the client to release the client and get it ready to accept another query.
Submitting another query while there exist result rows waiting to be read,
causes a new client to be pushed in the SPAWK stack.
spawk_fetchrow(row [, nosplit])
spawk_fetchrow
asks for the next result row to be retrieved
from the client that processed the last query passed to spawk_query
.
If there are no more result rows to be retrieved, the function returns 0,
else the result row is returned in row[0], while row[1],
row[2],… row[n] are filled in with the corresponding column
values and, finally, the number of columns is returned.
If passed a (non zero) nosplit value, spawk_fetchrow
does not split the row in columns and a value of 1 is returned on success.
spawk_fetchrest()
spawk_fetchrest
is used to skip all remainig result rows.
spawk_fetchone(row [, nosplit])
spawk_fetchone
function is just like spawk_fetchrow
,
but skips all remaining result rows.
spawk_fetchone
returns 0 if there's no result row to be returned.
spawk_submit(query)
spawk_submit
submits a DML/DDL query to the database server
for execution.
Normally, no result rows willl be produced from such queries.
spawk_submit
returns the number of the rows affected,
or 0 for DDL queries and DML queries that did not affect any rows.
On failure -1 is returned.
spawk_error(msg)
spawk_error
is a miscellaneous function that prints an
error to the standard error.
spawk_fatal(msg, err)
spawk_fatal
is just like spawk_error
,
but exits AWK program with exit status err
.