Tag Archives: Databricks

Fast-ish Load Data to SQL from Databricks

I saw a Reddit thread last week about someone’s issue with having to serve Delta table data at very low latency. This is typical of OLTP applications. Sometimes data that is stored in your lake needs to be copied to a RDBMS to provide the speed that is needed in mobile or desktop applications. This is a common problem I have faced in the past years where before things like lakehouse applications, there was no real easy way to accomplish the speed that was needed. Often, we found ourselves copying small sets of data and merging them into target tables, but other times we needed to copy huge sets of data.

Continue reading

SQL Managed Instance Push to Databricks Delta Live Tables via CETAS and APIs

Let’s face it, a lot of a data engineer’s time is spent waiting to see if things executed as expected or for data to be refreshed; We write pipelines, buy expensive replication software, or sometime manually move files (I hope we still aren’t in this day and age), and in the end all of this has a cost associated with it when working in a cloud environment. In the case of Databricks jobs, we often find ourselves creating clusters just to move data, where the cluster lays dormant for the most part during these extractions. In my eyes, that’s wasteful and could probably be improved upon.

Continue reading

Conferences Conferences Conferences

I feel as if the end of the year is always conference season for me because my summers are usually jam packed with travel since that is the only time of year my wife has off. I missed the Databricks conference in San Francisco this year because of that, but made up for it in the last quarter by attending various other events and even a two day “Databricks World Tour” event in New York City.

Continue reading

Deploy Non-Notebook Workspace Files to Databricks Workspaces with PowerShell

Databricks recently announced general availability of their “New Files Experience for Databricks Workspace”. It allows you to store and interact with non-notebook type files alongside your notebooks. This a very powerful feature when it comes to code modularization/reuse as well as environment configuration. The thing is, is that getting these to a non-git-enabled environment can be tricky.

: Deploy Non-Notebook Workspace Files to Databricks Workspaces with PowerShell Continue reading

Google Geocoding API with Spark

A couple of days ago while browsing Reddit, I came across someone asking if anyone had used the Google Geocoding API and how they went about doing it. Having recently done it, I offered by assistance, but also felt compelled to write up a blog post. Because of course like most things Spark or API related, there isn’t much out there in terms of actual examples. So here’s me effort in trying to share how we went about adding geocoding to our dataframes for addresses or lat/longs.

Continue reading

Near Real Time Ingestion with Databricks Delta

There are many approaches to NRT and some might argue that there really isn’t a reporting need that warrants the ingestion of data at this rate. Sure, if there’s streaming available and you want to see what’s going on at the current moment, then a Lambda architecture might be what you are after. But what about those other use cases when a user just wants to see their data in a dimensional model as fast as possible? I typically argue against why that’s even necessary, but sometimes the powers at be dictate it to happen. So what approach do you take?

Continue reading

Unit Testing Databricks Notebooks – Part 4

Build Pipeline

In the last three posts we’ve covered the why and how of this approach. We’ve successfully built a notebook that can use different databases to conduct transformations from and either export to a CSV file or write to Delta Lake. Now, we need to incorporate both of these objects into a CI/CD process.
To easier illustrate the process this pipeline was built in the classis editor but a YAML file is available in the GitHub repository for this project.

Continue reading

Unit Testing Databricks Notebooks – Part 3

In the last two sections, we’ve covered the overall approach to the problem of unit testing notebooks as well as a notebook structure that enables source data from different databases, but how does the data get to those databases?

As I explained in the first post, we are using the medallion architecture. On normal runs, the query in the example would source its data from Silver zone delta tables, but during test runs it sources from a database called “unit_test_db”.

We need to create some sort of process that takes seed data and populates the tables in this database, and this is where databricks-connect and a Python project come into play.

Continue reading

Unit Testing Databricks Notebooks – Part 2

Notebook Structure

Now that we have an understanding of the motivators behind putting together a framework like this we can get into how it’s implemented. There are two parts to this implementation: the notebooks and the unit test project. In this section we will go over how to structure a notebook so it lends itself nicely to unit testing with seed and assertion data.

In order to accomplish the task of unit testing, we need to be able to use seed data in our Spark SQL commands. To do this we set the source database for the Spark SQL commands based on a configuration.

Continue reading