Unit Testing Databricks Notebooks – Part 2

Notebook Structure

Now that we have an understanding of the motivators behind putting together a framework like this we can get into how it’s implemented. There are two parts to this implementation: the notebooks and the unit test project. In this section we will go over how to structure a notebook so it lends itself nicely to unit testing with seed and assertion data.

In order to accomplish the task of unit testing, we need to be able to use seed data in our Spark SQL commands. To do this we set the source database for the Spark SQL commands based on a configuration.

In the following examples we use the JD Edwards system to create a very simple dimension of subsidiaries. The Subsidiary number is joined to the user defined codes table to get the name of the subsidiary.

Dynamically setting Database in Spark SQL

Notice the “jde_dwDbName” is controlled via a variable. This variable is set based on where we choose to read our data from.

if config == "Test":
  jde_dwDbName = "unit_test_db.jde_"
else:
  jde_dwDbName = "jde_dw."

From here we decide, also based on configuration, what to do with the temp view that is created based off of the SparkSQL command.

Writing to csv or delta depending on config

Regardless of the config, a file is either updated or created. If we have chosen “Test” then we write the file to our test zone of our data lake and otherwise we write to our gold zone. You can see we have a helper function in the notebook that also writes our files with friendly names.

Complete Notebook

The complete notebook can be found here:

https://github.com/CharlesRinaldini/Databricks-UnitTestNotebooks/blob/master/notebooks/TRAN_Simple_Test.py

*Note: This notebook only adds a delta lake table or creates a test output to illustrate how this can be controlled via a variable for configuration. It would have to be expanded upon to merge into an existing table or do anything else.

Continue to Part 3

Leave a Reply

Your email address will not be published. Required fields are marked *