Category Archives: Code

Fast-ish Load Data to SQL from Databricks

I saw a Reddit thread last week about someone’s issue with having to serve Delta table data at very low latency. This is typical of OLTP applications. Sometimes data that is stored in your lake needs to be copied to a RDBMS to provide the speed that is needed in mobile or desktop applications. This is a common problem I have faced in the past years where before things like lakehouse applications, there was no real easy way to accomplish the speed that was needed. Often, we found ourselves copying small sets of data and merging them into target tables, but other times we needed to copy huge sets of data.

Continue reading →

Poor Man’s Data Lineage and Classification in Power BI with Sankey Charts

Leave a reply

There’s a huge need in any data driven organization to know where data is coming from, how it got there, and other characteristics of it. There are a variety of tools at our disposal that can do a great job of this and in most cases do it automagically. But what if you didn’t want to pay to use Purview or Unity Catalog and just needed something either to get the job done or as a proof of concept, to show business value, that would lead into the usage of one of these tools? Well you might not know, but Power BI has a free market place item that is great for that: Sankey Charts.

Continue reading →

My Master Data Management and Data Quality Journey

Leave a reply

When I first relocated to Colorado I worked for a global company in a local role. During my almost 5 year stay there, I grew into a global role with oversight over all things data coming from and going into their ERP system. I helped with 6 regional implementations into a global JD Edwards system and worked very closely with the brilliant Jiri Svoboda. At the time he held the position of Global Master Data Manager and I head the position of Global ERP Data Architect Manager. We were at the beginning of the company’s digital transformation and were tasked to take the somewhat manual Data Migration and Master Data Management (MDM) processes and modernize them. We shopped around various Gartner leaders in data management solutions. We had demos and POCs from both IBM, Informatica, and some boutique data management vendors. We landed on Informatica because they were, and still are, lightyears ahead of others in the world of MDM. What we learned was that Master Data Management, Data Governance, and analytics on data quality are super important, but tools will only get you so far. In this blog post, I’ll share my perspective on what worked for us and what I would do differently if I had to do it again.

Continue reading →

SQL Managed Instance Push to Databricks Delta Live Tables via CETAS and APIs

Leave a reply

Let’s face it, a lot of a data engineer’s time is spent waiting to see if things executed as expected or for data to be refreshed; We write pipelines, buy expensive replication software, or sometime manually move files (I hope we still aren’t in this day and age), and in the end all of this has a cost associated with it when working in a cloud environment. In the case of Databricks jobs, we often find ourselves creating clusters just to move data, where the cluster lays dormant for the most part during these extractions. In my eyes, that’s wasteful and could probably be improved upon.

Continue reading →

Conferences Conferences Conferences

Leave a reply

I feel as if the end of the year is always conference season for me because my summers are usually jam packed with travel since that is the only time of year my wife has off. I missed the Databricks conference in San Francisco this year because of that, but made up for it in the last quarter by attending various other events and even a two day “Databricks World Tour” event in New York City.

Continue reading →

Deploy Non-Notebook Workspace Files to Databricks Workspaces with PowerShell

Leave a reply

Databricks recently announced general availability of their “New Files Experience for Databricks Workspace”. It allows you to store and interact with non-notebook type files alongside your notebooks. This a very powerful feature when it comes to code modularization/reuse as well as environment configuration. The thing is, is that getting these to a non-git-enabled environment can be tricky.

Continue reading →

Track Delta Table History with T-SQL

Leave a reply

Delta tables are Databricks’ method of choice when storing data in the Lakehouse. They are useful when needing to persist dataframes to disc for speed, later use, or to apply updates. In some cases you need to track how often a Delta table is updated outside of Databricks.

Continue reading →

Foosball Scoring with Raspberry Pi

Leave a reply

If you know me, you know besides corgis I have another passion and that’s for foosball. I might not be tournament level good, but I can keep up with my friends and in the bar scene. I had an old sporting goods store level table when I was in high school and got rid of it when I moved out of my folk’s house. I swore I’d never buy another one because no one would ever come over and play and every once in a while, during my adult life, I’d see a nice one pop up on Craigslist for a good price. I always passed up the chance of getting a new one. Until one day when I saw a Tornado going for around 400 bucks. I couldn’t pass up the deal. A few weeks after getting it setup in our sunroom, I was going through some boxes and found an old Raspberry Pi that I used to use for playing emulated games on. I thought to myself “what would be a good project to use this on?” and then it dawned on me. I should automate a scoreboard for the foosball table.

Continue reading →

DBTA Data Summit Boston

Leave a reply

I’d never heard of the DBTA Data Summit until this year. Apparently this is the 9th one, so you’d think I would have heard of it by now, seeing that I’ve been in the industry for longer than its existence. I usually attend the bigger conventions each year, but this year due to having booked vacations that I’d much rather be on, I couldn’t attend the normal PASS, Microsoft, or Databricks conferences. I was really excited to see and hear about the developments in the world of Spark in San Francisco, but I wouldn’t be able to make it there this year. After a quick google, Data Summit seemed to be my only choice. So off I went to Boston…

Continue reading →

The Most Important Tool in Any Data Engineer’s Wheelhouse

Leave a reply

There are a lot of productivity tools out there that can help accelerate development, help organize things better, or even speed up processes. But there’s on tool that I go to more often than not that very few developers know about…

Continue reading →

Corgis And Code

Adventures in Data Engineering And Dogs

Category Archives: Code

Fast-ish Load Data to SQL from Databricks

Poor Man’s Data Lineage and Classification in Power BI with Sankey Charts

My Master Data Management and Data Quality Journey

SQL Managed Instance Push to Databricks Delta Live Tables via CETAS and APIs

Conferences Conferences Conferences

Deploy Non-Notebook Workspace Files to Databricks Workspaces with PowerShell

Track Delta Table History with T-SQL

Foosball Scoring with Raspberry Pi

DBTA Data Summit Boston

The Most Important Tool in Any Data Engineer’s Wheelhouse