Author Archives: corgiadmin

Google Geocoding API with Spark

A couple of days ago while browsing Reddit, I came across someone asking if anyone had used the Google Geocoding API and how they went about doing it. Having recently done it, I offered by assistance, but also felt compelled to write up a blog post. Because of course like most things Spark or API related, there isn’t much out there in terms of actual examples. So here’s me effort in trying to share how we went about adding geocoding to our dataframes for addresses or lat/longs.

Continue reading

Near Real Time Ingestion with Databricks Delta

There are many approaches to NRT and some might argue that there really isn’t a reporting need that warrants the ingestion of data at this rate. Sure, if there’s streaming available and you want to see what’s going on at the current moment, then a Lambda architecture might be what you are after. But what about those other use cases when a user just wants to see their data in a dimensional model as fast as possible? I typically argue against why that’s even necessary, but sometimes the powers at be dictate it to happen. So what approach do you take?

Continue reading

Unit Testing Databricks Notebooks – Part 4

Build Pipeline

In the last three posts we’ve covered the why and how of this approach. We’ve successfully built a notebook that can use different databases to conduct transformations from and either export to a CSV file or write to Delta Lake. Now, we need to incorporate both of these objects into a CI/CD process.
To easier illustrate the process this pipeline was built in the classis editor but a YAML file is available in the GitHub repository for this project.

Continue reading

Unit Testing Databricks Notebooks – Part 3

In the last two sections, we’ve covered the overall approach to the problem of unit testing notebooks as well as a notebook structure that enables source data from different databases, but how does the data get to those databases?

As I explained in the first post, we are using the medallion architecture. On normal runs, the query in the example would source its data from Silver zone delta tables, but during test runs it sources from a database called “unit_test_db”.

We need to create some sort of process that takes seed data and populates the tables in this database, and this is where databricks-connect and a Python project come into play.

Continue reading

Unit Testing Databricks Notebooks – Part 2

Notebook Structure

Now that we have an understanding of the motivators behind putting together a framework like this we can get into how it’s implemented. There are two parts to this implementation: the notebooks and the unit test project. In this section we will go over how to structure a notebook so it lends itself nicely to unit testing with seed and assertion data.

In order to accomplish the task of unit testing, we need to be able to use seed data in our Spark SQL commands. To do this we set the source database for the Spark SQL commands based on a configuration.

Continue reading

Unit Testing Databricks Notebooks – Part 1

Background

There’s not much out there on the subject of how to unit test Databricks notebooks when it comes to using them to create a Kimball data warehouse or a data lake house. This is because of a couple reasons. One: most people who have historically worked in Business Intelligence aren’t in the habit of writing automated unit testing for their code, or two: unit testing of notebooks seems to be an impossible task to accomplish if you keep your ETL code entirely inside of Databricks notebooks. Some might argue that notebooks aren’t enterprise grade ETL in terms of SDLC and CI/CD capabilities. I agree. If you’ve got a team of software developers who are well versed in OOP then you should be building Python wheels that are deployed and ran on your Databricks clusters. But what happens when you’ve got a team who is only familiar with tools such as SSIS or Informatica, and the extent of their development experience begins and ends with the Script Component and Script Task in SSIS?

Continue reading

Hello world!

Coming soon…

Hi! My name is Charles Rinaldini and I’ve finally decided to start blogging about my adventures in the data world. I’ve been a data enthusiast and business intelligence engineer for more than a decade. I’m passionate about my work, learning, and especially my dog Andre the Giant.

Stay tuned for more updates as I try to figure out what’s most useful to share with everyone!