PPL — Producing random data
Introduction
Based on AWK rand
/srand
functions, PPL library may be
used to produce random data of some of the basic data types: strings, numbers
and dates.
We can use PPL functions to produce millions of data rows in order to populate
databases for testing purposes (PPL name, indeed, comes from the word populate).
The following AWK script will produce a random number (between 100 and 200)
of random length (between 10 and 20 characters long) strings of lowercase english
letters:
@include "ppl.awk" BEGIN { srand() for (i = ppl_integer(100, 200); i > 0; i--) print ppl_string(10, 20, ppl_lower) }
PPL functions may be used not only for populating test databases, but for testing any kind of program that manipulates ascii data. Let's say we've just developed a C program that implements a new sort algorithm (wow!) and we need to run some tests in order to validate the correctness and check the efficiency of our newly created program. One of our tests could be to sort a million line ascii data file of two tab separated columns, namely an 8-digit integer in the first column and a string of length 40 to 60 characters long in the second column:
@include "ppl.awk" BEGIN { OFS = "\t" srand() for (count += 0; count > 0; count--) print ppl_integer(10000000, 99999999), ppl_string(40, 60, ppl_lower) }
As you see, count
variable is not set, so one can set it in
the command line. Given that our program is stored in the testsort.awk
file, run the following command to create the file test1M.data
:
awk -v count=1000000 -f testsort.awk >test1M.data
In less than 10 seconds 1000000 lines of random data will be written by AWK
in file test1M.data
:
37331048 xvugbztwqwglwsoiyisjqrylnsamyojobfzkqboiupece 98140409 qyhpnclgbqizghmvwlsnrqeoyetxpovghswsgdwwmezjsqrgpizkjxdmsv 90470326 jvqpusincosbmeteeqpxlapjpvgifjtkisywtbixrtgeldf 12774313 juyasljhemhepduutqelxhuwhcbpaqrlkkqoshptysashqrzxih 14682720 jcboufzmueiixewzyxsgjryfcaumrhdzenugidtogfprstdoupgmgou 80821224 orcobtgrwvphzsjgwqbumaymdgtrfhzodyhwubzqzejdwgooymcgfvejf 68140173 aioyvowxskvknzzlqqjogpiarjqtqvclkldwrsba 53573553 robnecihajytbmlgeowicdedfxijkozzzbrxspbjbydassvqlywpica 90038513 vajlrjahobbdlyigmsnawlziushbtnxjtqpzjozc 80500364 qwldbvcovkltapgbtntvuyifbeliywikilsahelcaunwvaafoieniezfdnrx ...
The PPL library
The PPL library is just an AWK script that contains a BEGIN
section and some function definitions; you can
view
the code online, but it's better to
download
the libray and view the code locally.
It's a good practice to locate the library in one of the AWKPATH
directories, so you can include it in your AWK scripts as follows:
@include "ppl.awk"
Alternatively you can include the PPL library in the command line using
the -f
option:
awk -f ppl.awk -f your_script data_files…
However, including the library in the application AWK script is more compact and gives the reader a clear sign that random data will be used in the sciprt.
The PPL API
PPL API consists of a small number of AWK functions that can be called from AWK scripts. Actually there are less than 10 functions in the API.
ppl_string(min, max, palette)
spawk_string
returns a random length string of letters
randomly choosed from the palette.
min
and max
parameters governs the length of
the string, as the returned string will be of random length between
these to numbers (inclusive).
palette
is an array of characters (1-based).
The length of the array must be stored in palette[0]
,
e.g. a palette of three letters, namely "A", "B" and "C" would be
an array like this:
ABC[0] = 3 ABC[1] = "A" ABC[2] = "B" ABC[3] = "C"
Because it's cumbersome to form a palette one item at a time,
PPL function ppl_palette
may be used to produce
palettes from arbitrary strings; the above palette would have
been produced by calling:
ppl_palette("ABC", ABC)
ppl_integer(min, max)
ppl_integer
function returns a random integer number value
between min
and max
(inclusive).
ppl_float(min, max)
ppl_float
function returns a random floating point number value
between min
and max
.
Returned values may reach min
but they never
reach max
, just like rand
values are returned
in [0, 1) range.
ppl_timestamp(min, max)
ppl_timestamp
function returns a random timestamp between
values min
and max
.
This function behaves much like the ppl_integer
function
but with these two particularities: if max
is missing,
then it will be set to the current system time;
if min
is negative then min
will be calculated
by subtracting the given number from max
.
ppl_login()
ppl_login
function returns a random string to simulate
simple login names, e.g. panos, maria, arnold etc.
Of course the names produced by ppl_login
are rarely
pronounceable, but who cares; programs do not pronounce login names,
do they?
ppl_domain()
ppl_domain
function returns a random string to simulate
domain names, as sdfheuejdj.asd, fhepll.xx, esdfdfgfg.syo etc.
As you may know, the after dot part is called TLD (top level domain);
TLDs produced by ppl_domain
function are 2 or 3 characters long,
while the before dot part of the produced domain is a random string of lowercase
english letters.
ppl_email()
ppl_email
function returns random email addresses
produced as a random login name, followed by a "@
" character,
followed by a random domain name.
ppl_name()
ppl_name
function returns random names, that is two-part space
separated character strings, the two parts to simulate the first and the last name.
Each part begins with a random uppercase english letter and is followed
by a number of lowercase english letters.
ppl_palette(string, palette)
Function ppl_palette
is used for palette construction.
The first parameter is an arbitrary string, while the second is
the palette to be constructed.
Palette is constructed as an 1-based array of all characters of the passed
string, while at position 0 of the array the palette length must
be stored.
ppl_palette
returns the length of the palette.