I’d never heard of the DBTA Data Summit until this year. Apparently this is the 9th one, so you’d think I would have heard of it by now, seeing that I’ve been in the industry for longer than its existence. I usually attend the bigger conventions each year, but this year due to having booked vacations that I’d much rather be on, I couldn’t attend the normal PASS, Microsoft, or Databricks conferences. I was really excited to see and hear about the developments in the world of Spark in San Francisco, but I wouldn’t be able to make it there this year. After a quick google, Data Summit seemed to be my only choice. So off I went to Boston…
I signed up for the pre-conference workshops as well as all the other offerings. Upon arrival on day one, I came to the realization that this would be a much smaller crowd than what I’m used to. On day one there may have been 100 people, which really had me questioning if I made a good choice, but by day two there was close 1000. Maybe a little bit less.
I got my badge that reassured me that I had made the right choice with getting the all access pass. It literally told me it was the best deal right on it.
Day 1
The first day was a choice of a couple of different 3 hour workshops. You were locked into your choices and could not switch between them if you weren’t getting any value from them. Sadly, the first I attended was this case. I signed up for the Essentials of Data Privacy & Security and was hoping I’d get some practical examples of how to implement these or even some cautionary tales of what to avoid. It turned out to be more of an overview of what legislation effects data.
Afterwards, I switched tracks and attended the Build Actionable Road Maps for Enterprise Data & Analytics. This one turned out to be very informative and gave great case studies on data maturity and what to look for when establishing a maturity level for your organization. It was presented by Wayne Eckerson, who could be an Anthony Bourdain lookalike. He went into detail on various clients his group has worked with an outlined the roadmaps that were developed during their engagements. I definitely left this session with some great ideas to take back to my current organization to share with our leadership when it comes to what’s next on our roadmap for data.
Day 2
The second day I chose to focus on DevOps and see what development there have been in the industry with the recently coined term “DataOps”. I’ve always called this “DevOps for Data”, but someone decided we need another buzz word.
The first session in this track was exactly what I wanted. I was looking for a session I could gather ideas from for an upcoming talk I was going to do internally at a company conference. Chris Bergh did a great job of showing how to sell the idea of DevOps or DataOps based upon the benefits of its implementation. Arguing that any type of testing is better than no testing. Even if that’s just running tests in production against the performance of your pipelines over time or data profiling based tests on sets of data that should not change all that much.
Later in the day there was another presentation that almost went against the first of the day. Instead of justifying DataOps as a time investment worth taking and focusing on driving value through automated testing, this speaker implied doing the opposite. Use DataOps from a fail fast approach. Not to focus on testing but rather speed to market through CI/CD process and the ability to make corrections rapidly.
I tend to be from the first speaker’s camp. If you’re constantly pushing bad changes to your customers, you’re going to end up spending the majority of your time fixing these issues and lose customer trust. I don’t believe fail fast applies to data projects since there is such a high risk of losing customer trust. It could be argued that you should fail fast in development, but not in production!
The third presentation I attended was a sales pitch. It did bring to light things we should all be looking at. Monitoring our logs in all our data processes to get better insights on costs and performance. To look for things that are out of the ordinary through linier regression e.g. are the counts of my ETL processes trending the same from an input record count to an output record count? Are certain stages in my Spark jobs starting to take longer than normal? And how to do we quantify where we are spending the most money throughout the entire lineage of our data?
Day 3
The last day was like any other conference where by the end of the day almost everyone had gone on their ways back home and very few remained for the final sessions and closing keynote. I didn’t have the luxury of having a flight back to Denver once the conference was over, so I stuck around for the whole thing and chose to fly out the following morning.
This was a good choice because the last couple of sessions I attended help solidify where we might focus our work in the near future. To start building more intelligent monitoring not only to report on pipelines but also on data quality in terms of completeness, volumetrics, and timeliness.
The final keynote of the event was on DaaS or Data Architecture as a Service. This was an interesting discussion on where the industry might be going in terms of self-serviced data pipelines. If we are to have citizen development on a pipeline level, is there a way to provide them a platform of templates that abides by our data architecture best practices and does not create siloed data/processes?
We were demoed a tool that had just launched that attempted to do this, but I could also see way to implement this within an organization without special tooling. We’ll see if that is possible eventually.
All in all it was a decent conference. I didn’t get as detailed of information as I would have at other conferences, but I don’t think that’s what the goal of this one is. I guess when you think of who put the event on – Database Trends and Applications, it makes sense that the conference would be heavily focused on trends and not so much on specifics.
I’d give it 7 out of 10 stars.