Tag Archives: DeltaLake

SQL Managed Instance Push to Databricks Delta Live Tables via CETAS and APIs

Let’s face it, a lot of a data engineer’s time is spent waiting to see if things executed as expected or for data to be refreshed; We write pipelines, buy expensive replication software, or sometime manually move files (I hope we still aren’t in this day and age), and in the end all of this has a cost associated with it when working in a cloud environment. In the case of Databricks jobs, we often find ourselves creating clusters just to move data, where the cluster lays dormant for the most part during these extractions. In my eyes, that’s wasteful and could probably be improved upon.

Continue reading

Querying Delta Lake with T-SQL via Synapse Serverless and Managed Instance

In this blog post I try to demystify how to setup an environment that utilizes Azure Synapse Serverless, Delta Lake on ADLS Gen2, and SQL Managed Instances to enable you to query your delta lake with T-SQL as if it were any other SQL source in order to accomplish something like polybase.

Continue reading

Near Real Time Ingestion with Databricks Delta

There are many approaches to NRT and some might argue that there really isn’t a reporting need that warrants the ingestion of data at this rate. Sure, if there’s streaming available and you want to see what’s going on at the current moment, then a Lambda architecture might be what you are after. But what about those other use cases when a user just wants to see their data in a dimensional model as fast as possible? I typically argue against why that’s even necessary, but sometimes the powers at be dictate it to happen. So what approach do you take?

Continue reading

Unit Testing Databricks Notebooks – Part 4

Build Pipeline

In the last three posts we’ve covered the why and how of this approach. We’ve successfully built a notebook that can use different databases to conduct transformations from and either export to a CSV file or write to Delta Lake. Now, we need to incorporate both of these objects into a CI/CD process.
To easier illustrate the process this pipeline was built in the classis editor but a YAML file is available in the GitHub repository for this project.

Continue reading

Unit Testing Databricks Notebooks – Part 3

In the last two sections, we’ve covered the overall approach to the problem of unit testing notebooks as well as a notebook structure that enables source data from different databases, but how does the data get to those databases?

As I explained in the first post, we are using the medallion architecture. On normal runs, the query in the example would source its data from Silver zone delta tables, but during test runs it sources from a database called “unit_test_db”.

We need to create some sort of process that takes seed data and populates the tables in this database, and this is where databricks-connect and a Python project come into play.

Continue reading

Unit Testing Databricks Notebooks – Part 2

Notebook Structure

Now that we have an understanding of the motivators behind putting together a framework like this we can get into how it’s implemented. There are two parts to this implementation: the notebooks and the unit test project. In this section we will go over how to structure a notebook so it lends itself nicely to unit testing with seed and assertion data.

In order to accomplish the task of unit testing, we need to be able to use seed data in our Spark SQL commands. To do this we set the source database for the Spark SQL commands based on a configuration.

Continue reading

Unit Testing Databricks Notebooks – Part 1

Background

There’s not much out there on the subject of how to unit test Databricks notebooks when it comes to using them to create a Kimball data warehouse or a data lake house. This is because of a couple reasons. One: most people who have historically worked in Business Intelligence aren’t in the habit of writing automated unit testing for their code, or two: unit testing of notebooks seems to be an impossible task to accomplish if you keep your ETL code entirely inside of Databricks notebooks. Some might argue that notebooks aren’t enterprise grade ETL in terms of SDLC and CI/CD capabilities. I agree. If you’ve got a team of software developers who are well versed in OOP then you should be building Python wheels that are deployed and ran on your Databricks clusters. But what happens when you’ve got a team who is only familiar with tools such as SSIS or Informatica, and the extent of their development experience begins and ends with the Script Component and Script Task in SSIS?

Continue reading