Let’s face it, a lot of a data engineer’s time is spent waiting to see if things executed as expected or for data to be refreshed; We write pipelines, buy expensive replication software, or sometime manually move files (I hope we still aren’t in this day and age), and in the end all of this has a cost associated with it when working in a cloud environment. In the case of Databricks jobs, we often find ourselves creating clusters just to move data, where the cluster lays dormant for the most part during these extractions. In my eyes, that’s wasteful and could probably be improved upon.
Continue readingTag Archives: Databricks
Conferences Conferences Conferences
I feel as if the end of the year is always conference season for me because my summers are usually jam packed with travel since that is the only time of year my wife has off. I missed the Databricks conference in San Francisco this year because of that, but made up for it in the last quarter by attending various other events and even a two day “Databricks World Tour” event in New York City.
Continue readingDeploy Non-Notebook Workspace Files to Databricks Workspaces with PowerShell
Databricks recently announced general availability of their “New Files Experience for Databricks Workspace”. It allows you to store and interact with non-notebook type files alongside your notebooks. This a very powerful feature when it comes to code modularization/reuse as well as environment configuration. The thing is, is that getting these to a non-git-enabled environment can be tricky.
: Deploy Non-Notebook Workspace Files to Databricks Workspaces with PowerShell Continue readingTrack Delta Table History with T-SQL
Delta tables are Databricks’ method of choice when storing data in the Lakehouse. They are useful when needing to persist dataframes to disc for speed, later use, or to apply updates. In some cases you need to track how often a Delta table is updated outside of Databricks.
Continue readingGoogle Geocoding API with Spark
A couple of days ago while browsing Reddit, I came across someone asking if anyone had used the Google Geocoding API and how they went about doing it. Having recently done it, I offered by assistance, but also felt compelled to write up a blog post. Because of course like most things Spark or API related, there isn’t much out there in terms of actual examples. So here’s me effort in trying to share how we went about adding geocoding to our dataframes for addresses or lat/longs.
Continue readingNear Real Time Ingestion with Databricks Delta
There are many approaches to NRT and some might argue that there really isn’t a reporting need that warrants the ingestion of data at this rate. Sure, if there’s streaming available and you want to see what’s going on at the current moment, then a Lambda architecture might be what you are after. But what about those other use cases when a user just wants to see their data in a dimensional model as fast as possible? I typically argue against why that’s even necessary, but sometimes the powers at be dictate it to happen. So what approach do you take?
Continue readingUnit Testing Databricks Notebooks – Part 4
Build Pipeline
In the last three posts we’ve covered the why and how of this approach. We’ve successfully built a notebook that can use different databases to conduct transformations from and either export to a CSV file or write to Delta Lake. Now, we need to incorporate both of these objects into a CI/CD process.
To easier illustrate the process this pipeline was built in the classis editor but a YAML file is available in the GitHub repository for this project.
Unit Testing Databricks Notebooks – Part 3
In the last two sections, we’ve covered the overall approach to the problem of unit testing notebooks as well as a notebook structure that enables source data from different databases, but how does the data get to those databases?
As I explained in the first post, we are using the medallion architecture. On normal runs, the query in the example would source its data from Silver zone delta tables, but during test runs it sources from a database called “unit_test_db”.
We need to create some sort of process that takes seed data and populates the tables in this database, and this is where databricks-connect and a Python project come into play.
Continue readingUnit Testing Databricks Notebooks – Part 2
Notebook Structure
Now that we have an understanding of the motivators behind putting together a framework like this we can get into how it’s implemented. There are two parts to this implementation: the notebooks and the unit test project. In this section we will go over how to structure a notebook so it lends itself nicely to unit testing with seed and assertion data.
In order to accomplish the task of unit testing, we need to be able to use seed data in our Spark SQL commands. To do this we set the source database for the Spark SQL commands based on a configuration.
Continue readingUnit Testing Databricks Notebooks – Part 1
Background
There’s not much out there on the subject of how to unit test Databricks notebooks when it comes to using them to create a Kimball data warehouse or a data lake house. This is because of a couple reasons. One: most people who have historically worked in Business Intelligence aren’t in the habit of writing automated unit testing for their code, or two: unit testing of notebooks seems to be an impossible task to accomplish if you keep your ETL code entirely inside of Databricks notebooks. Some might argue that notebooks aren’t enterprise grade ETL in terms of SDLC and CI/CD capabilities. I agree. If you’ve got a team of software developers who are well versed in OOP then you should be building Python wheels that are deployed and ran on your Databricks clusters. But what happens when you’ve got a team who is only familiar with tools such as SSIS or Informatica, and the extent of their development experience begins and ends with the Script Component and Script Task in SSIS?
Continue reading