PPL — Producing random data

Introduction

Based on AWK rand/srand functions, PPL library may be used to produce random data of some of the basic data types: strings, numbers and dates. We can use PPL functions to produce millions of data rows in order to populate databases for testing purposes (PPL name, indeed, comes from the word populate). The following AWK script will produce a random number (between 100 and 200) of random length (between 10 and 20 characters long) strings of lowercase english letters:

@include "ppl.awk"

BEGIN {
	srand()
	for (i = ppl_integer(100, 200); i > 0; i--)
	print ppl_string(10, 20, ppl_lower)
}

PPL functions may be used not only for populating test databases, but for testing any kind of program that manipulates ascii data. Let's say we've just developed a C program that implements a new sort algorithm (wow!) and we need to run some tests in order to validate the correctness and check the efficiency of our newly created program. One of our tests could be to sort a million line ascii data file of two tab separated columns, namely an 8-digit integer in the first column and a string of length 40 to 60 characters long in the second column:

@include "ppl.awk"

BEGIN {
	OFS = "\t"

	srand()
	for (count += 0; count > 0; count--)
	print ppl_integer(10000000, 99999999), ppl_string(40, 60, ppl_lower)
}

Download

As you see, count variable is not set, so one can set it in the command line. Given that our program is stored in the testsort.awk file, run the following command to create the file test1M.data:

awk -v count=1000000 -f testsort.awk >test1M.data

In less than 10 seconds 1000000 lines of random data will be written by AWK in file test1M.data:

37331048	xvugbztwqwglwsoiyisjqrylnsamyojobfzkqboiupece
98140409	qyhpnclgbqizghmvwlsnrqeoyetxpovghswsgdwwmezjsqrgpizkjxdmsv
90470326	jvqpusincosbmeteeqpxlapjpvgifjtkisywtbixrtgeldf
12774313	juyasljhemhepduutqelxhuwhcbpaqrlkkqoshptysashqrzxih
14682720	jcboufzmueiixewzyxsgjryfcaumrhdzenugidtogfprstdoupgmgou
80821224	orcobtgrwvphzsjgwqbumaymdgtrfhzodyhwubzqzejdwgooymcgfvejf
68140173	aioyvowxskvknzzlqqjogpiarjqtqvclkldwrsba
53573553	robnecihajytbmlgeowicdedfxijkozzzbrxspbjbydassvqlywpica
90038513	vajlrjahobbdlyigmsnawlziushbtnxjtqpzjozc
80500364	qwldbvcovkltapgbtntvuyifbeliywikilsahelcaunwvaafoieniezfdnrx
...

The PPL library

The PPL library is just an AWK script that contains a BEGIN section and some function definitions; you can view the code online, but it's better to download the libray and view the code locally. It's a good practice to locate the library in one of the AWKPATH directories, so you can include it in your AWK scripts as follows:

@include "ppl.awk"

Alternatively you can include the PPL library in the command line using the -f option:

awk -f ppl.awk -f your_script data_files…

However, including the library in the application AWK script is more compact and gives the reader a clear sign that random data will be used in the sciprt.

The PPL API

PPL API consists of a small number of AWK functions that can be called from AWK scripts. Actually there are less than 10 functions in the API.


ppl_string(min, max, palette)

spawk_string returns a random length string of letters randomly choosed from the palette. min and max parameters governs the length of the string, as the returned string will be of random length between these to numbers (inclusive). palette is an array of characters (1-based). The length of the array must be stored in palette[0], e.g. a palette of three letters, namely "A", "B" and "C" would be an array like this:

ABC[0] = 3
ABC[1] = "A"
ABC[2] = "B"
ABC[3] = "C"

Because it's cumbersome to form a palette one item at a time, PPL function ppl_palette may be used to produce palettes from arbitrary strings; the above palette would have been produced by calling:

ppl_palette("ABC", ABC)


ppl_integer(min, max)

ppl_integer function returns a random integer number value between min and max (inclusive).


ppl_float(min, max)

ppl_float function returns a random floating point number value between min and max. Returned values may reach min but they never reach max, just like rand values are returned in [0, 1) range.


ppl_timestamp(min, max)

ppl_timestamp function returns a random timestamp between values min and max. This function behaves much like the ppl_integer function but with these two particularities: if max is missing, then it will be set to the current system time; if min is negative then min will be calculated by subtracting the given number from max.


ppl_login()

ppl_login function returns a random string to simulate simple login names, e.g. panos, maria, arnold etc. Of course the names produced by ppl_login are rarely pronounceable, but who cares; programs do not pronounce login names, do they?


ppl_domain()

ppl_domain function returns a random string to simulate domain names, as sdfheuejdj.asd, fhepll.xx, esdfdfgfg.syo etc. As you may know, the after dot part is called TLD (top level domain); TLDs produced by ppl_domain function are 2 or 3 characters long, while the before dot part of the produced domain is a random string of lowercase english letters.


ppl_email()

ppl_email function returns random email addresses produced as a random login name, followed by a "@" character, followed by a random domain name.


ppl_name()

ppl_name function returns random names, that is two-part space separated character strings, the two parts to simulate the first and the last name. Each part begins with a random uppercase english letter and is followed by a number of lowercase english letters.


ppl_palette(string, palette)

Function ppl_palette is used for palette construction. The first parameter is an arbitrary string, while the second is the palette to be constructed. Palette is constructed as an 1-based array of all characters of the passed string, while at position 0 of the array the palette length must be stored. ppl_palette returns the length of the palette.