There’s a huge need in any data driven organization to know where data is coming from, how it got there, and other characteristics of it. There are a variety of tools at our disposal that can do a great job of this and in most cases do it automagically. But what if you didn’t want to pay to use Purview or Unity Catalog and just needed something either to get the job done or as a proof of concept, to show business value, that would lead into the usage of one of these tools? Well you might not know, but Power BI has a free market place item that is great for that: Sankey Charts.
Continue readingMy Master Data Management and Data Quality Journey
When I first relocated to Colorado I worked for a global company in a local role. During my almost 5 year stay there, I grew into a global role with oversight over all things data coming from and going into their ERP system. I helped with 6 regional implementations into a global JD Edwards system and worked very closely with the brilliant Jiri Svoboda. At the time he held the position of Global Master Data Manager and I head the position of Global ERP Data Architect Manager. We were at the beginning of the company’s digital transformation and were tasked to take the somewhat manual Data Migration and Master Data Management (MDM) processes and modernize them. We shopped around various Gartner leaders in data management solutions. We had demos and POCs from both IBM, Informatica, and some boutique data management vendors. We landed on Informatica because they were, and still are, lightyears ahead of others in the world of MDM. What we learned was that Master Data Management, Data Governance, and analytics on data quality are super important, but tools will only get you so far. In this blog post, I’ll share my perspective on what worked for us and what I would do differently if I had to do it again.
Continue readingSQL Managed Instance Push to Databricks Delta Live Tables via CETAS and APIs
Let’s face it, a lot of a data engineer’s time is spent waiting to see if things executed as expected or for data to be refreshed; We write pipelines, buy expensive replication software, or sometime manually move files (I hope we still aren’t in this day and age), and in the end all of this has a cost associated with it when working in a cloud environment. In the case of Databricks jobs, we often find ourselves creating clusters just to move data, where the cluster lays dormant for the most part during these extractions. In my eyes, that’s wasteful and could probably be improved upon.
Continue readingConferences Conferences Conferences
I feel as if the end of the year is always conference season for me because my summers are usually jam packed with travel since that is the only time of year my wife has off. I missed the Databricks conference in San Francisco this year because of that, but made up for it in the last quarter by attending various other events and even a two day “Databricks World Tour” event in New York City.
Continue readingDeploy Non-Notebook Workspace Files to Databricks Workspaces with PowerShell
Databricks recently announced general availability of their “New Files Experience for Databricks Workspace”. It allows you to store and interact with non-notebook type files alongside your notebooks. This a very powerful feature when it comes to code modularization/reuse as well as environment configuration. The thing is, is that getting these to a non-git-enabled environment can be tricky.
: Deploy Non-Notebook Workspace Files to Databricks Workspaces with PowerShell Continue readingTrack Delta Table History with T-SQL
Delta tables are Databricks’ method of choice when storing data in the Lakehouse. They are useful when needing to persist dataframes to disc for speed, later use, or to apply updates. In some cases you need to track how often a Delta table is updated outside of Databricks.
Continue readingFoosball Scoring with Raspberry Pi
If you know me, you know besides corgis I have another passion and that’s for foosball. I might not be tournament level good, but I can keep up with my friends and in the bar scene. I had an old sporting goods store level table when I was in high school and got rid of it when I moved out of my folk’s house. I swore I’d never buy another one because no one would ever come over and play and every once in a while, during my adult life, I’d see a nice one pop up on Craigslist for a good price. I always passed up the chance of getting a new one. Until one day when I saw a Tornado going for around 400 bucks. I couldn’t pass up the deal. A few weeks after getting it setup in our sunroom, I was going through some boxes and found an old Raspberry Pi that I used to use for playing emulated games on. I thought to myself “what would be a good project to use this on?” and then it dawned on me. I should automate a scoreboard for the foosball table.
DBTA Data Summit Boston
I’d never heard of the DBTA Data Summit until this year. Apparently this is the 9th one, so you’d think I would have heard of it by now, seeing that I’ve been in the industry for longer than its existence. I usually attend the bigger conventions each year, but this year due to having booked vacations that I’d much rather be on, I couldn’t attend the normal PASS, Microsoft, or Databricks conferences. I was really excited to see and hear about the developments in the world of Spark in San Francisco, but I wouldn’t be able to make it there this year. After a quick google, Data Summit seemed to be my only choice. So off I went to Boston…
Continue readingThe Most Important Tool in Any Data Engineer’s Wheelhouse
There are a lot of productivity tools out there that can help accelerate development, help organize things better, or even speed up processes. But there’s on tool that I go to more often than not that very few developers know about…
Continue readingQuerying Delta Lake with T-SQL via Synapse Serverless and Managed Instance
In this blog post I try to demystify how to setup an environment that utilizes Azure Synapse Serverless, Delta Lake on ADLS Gen2, and SQL Managed Instances to enable you to query your delta lake with T-SQL as if it were any other SQL source in order to accomplish something like polybase.
Continue reading