Generating Test Data Using Faker

Faker is a great module for unit testing and stress testing your app. Whether you need to randomly generate a large amount of data or simply need structured test data, Faker is a great tool for this job.

Installing the library can be easily done via pip:

pip install faker

For this demo, I am going to generate a large CSV file of invoices. Then, I’ll loop though them to get some totals. In a real project, this might involve loading data into a database, then querying it using huge amounts of data. For this example, we will keep the sizes and scope a little more manageable.

I have created a file, faker_test.py. For starters, we need to import the library and create a new Faker object. I have also setup a constant for a record count which can easily be changed to adjust the size of our stress test.

faker_test.py

from faker import Faker

RECORD_COUNT = 100000
fake = Faker()

Now that we’ve got our fake variable setup to create a new Faker instance, getting simulated data will be as simple as calling fake.name() or fake.city().

The next step will involve creating a function to generate a CSV file. We need to import the csv and random built-in libraries.

faker_test.py

import csv
import random
from decimal import Decimal

...

def create_csv_file():
    with open('./files/invoices.csv', 'w', newline='') as csvfile:
        fieldnames = ['first_name', 'last_name', 'email', 'product_id', 'qty',
                      'amount', 'description', 'address', 'city', 'state',
                      'country']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        for i in range(RECORD_COUNT):
            writer.writerow(
                {
                    'first_name': fake.name(),
                    'last_name': fake.name(),
                    'email': fake.email(),
                    'product_id': fake.random_int(min=100, max=199),
                    'qty': fake.random_int(min=1, max=9),
                    'amount': float(Decimal(random.randrange(500, 10000))/100),
                    'description': fake.sentence(),
                    'address': fake.street_address(),
                    'city': fake.city(),
                    'state': fake.state(),
                    'country': fake.country()
                }
            )

Note: for numbers, Faker offers a random_int method. We can also use the random package for this also, similar to what’s being used for the amount. Either one will do.

In this next step, we will define a get_totals function. This will loop though the file to get the totals.

def get_totals():
    qty_total = 0
    amount_total = 0
    with open('./files/invoices.csv', 'r', newline='') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        for row in reader:
            if row[4] != 'qty':
                qty = int(row[4])
                qty_total += qty

                amount = float(row[5])
                amount_total += amount
    return qty_total, amount_total

To keep things simple, we will use the time module to record how long each task takes. That is, creating the file and than finding the totals. We can go ahead and add time to our list of imports.

import csv
import random
from time import time
from decimal import Decimal
from faker import Faker

In our double-under name equals double-under main, we can keep track of the time. The elapsed variable will indicate the time each of the tasks took.

if __name__ == '__main__':
    start = time()
    create_csv_file()
    elapsed = time() - start
    print('created csv file time: {}'.format(elapsed))

    start = time()
    qty_total, amount_total = get_totals()
    elapsed = time() - start
    print('got totals time: {}'.format(elapsed))

    print('qty: {}'.format(qty_total))
    print('amount: {}'.format(amount_total))

For 100,000 records, creating the CSV file took, by far, the most time:

created csv file time: 140.39420413970947
got totals time: 0.5633819103240967
qty: 497770
amount: 5255095.880000033

In more complex applications, faker can be a very valuable tool in finding bottlenecks and stress testing an application. It is also effective when adding to unit tests. You will be able to generate test data very quickly and easily.

One final note: the Faker library also offers different collections of properties, called “Providers”. Depending on the nature of data you’re looking for, one of these might be a good fit for you. They range from person profiles to address to credit card info. You can find more here: https://faker.readthedocs.io/en/master/providers.html.

Here is the full source for this demo:

Posted in python