Data, Data, and Metadata: Letting ChatGPT Interpret Power BI Output

Rob Collie

Founder and CEO Connect with Rob on LinkedIn

Justin Mannhardt

Chief Customer Officer Connect with Justin on LinkedIn

Data, Data, and Metadata: Letting ChatGPT Interpret Power BI Output

What happens when you hand off your Power BI output to ChatGPT and ask it to make sense of your world? You might be surprised. This week, Rob shares a deeply personal use case. One that ties together two major themes we’ve been exploring:

  1. Gen AI is reshaping the way we think about dashboards.
  2. To get real value out of AI, you need more than just data. You need metadata.

And yes, that kind of metadata—the kind you create in Power BI when you translate raw data into something meaningful.

Along the way, we revisit the old guard of data warehousing. The mighty (and now dusty?) ETL priesthood. And we uncover a delicious little irony about how the future of data looks a lot like its past, just with better tools and smarter questions.

The big twist? We’re all ETL now. But the “T” might not mean what you think it does anymore.

Listen now to find out how a few rows of carefully modeled data, a table visual, and one really good AI assistant changed the game. For Rob and, just possibly, for all of us.

Also in this episode:
Blind Melon – Change (YouTube)

The Data Warehouse Toolkit

Raw Data Episode – The Human Side of Data: Using Analytics for Personal Well-Being

Episode Transcript

Rob Collie: Hello, friends. There are two themes we've been exploring in recent episodes that are increasingly converging in my mind. One is how AI is going to potentially upend our traditional understanding of dashboards and the ways in which we use them. The other theme is how in order to get the most out of AI, you need to feed it data, data and metadata.

Well over the weekend I had a personal experience that really gelled those themes for me in like a super visceral way. I want to share that experience with you. [00:00:30] But first I want to acknowledge a parallel from the past that I find particularly delicious and ironic. When I first realized back in 2010 how revolutionary Power BI was going to be, it almost immediately put me on a collision course with the data warehousing priesthood. I knew this priesthood existed, but I didn't know just how deep it went. I didn't know it had a bible, The Data Warehousing Toolkit written by Ralph Kimball, whose name was always spoken in hushed reverence. [00:01:00] I didn't know that The Data Warehousing Bible weighed in at 608 pages and 2.2 pounds qualifying it as a deadly weapon in close quarters. And I also didn't know that there were people who would immediately dismiss you if you hadn't dedicated a decade or more of your life to said Bible.

Now, in fairness, this was more than just a cult of opinion. It was a profession, a career. It paid well and work was plentiful, so they weren't wrong to invest their time, energy, [00:01:30] and faith in it. But the universe did have a dirty little twist in the works, it was about to remind the priesthood of something the priesthood had long forgotten, if indeed it had ever known in the first place. And the reminder was this, your area of expertise, data warehousing, is a means to an end. Actually, I'll restate that and remove the word just. It was a means to a valuable end, and was itself therefore valuable. But the twist was that the data [00:02:00] warehousing priesthood had long lulled itself into the delusion that the data warehouse was itself a valuable output.

Now, this is simultaneously a ridiculous thing to believe while also being an easy thing to fall into. If you're paid big money to do X for many years, and that never changes. It really is just human nature to sort of shortcut one's own thinking and say X is valuable in its own right. But data warehousing is almost [00:02:30] by definition just a supporting role. Even the name itself, warehousing, belies this truth. Is a physical, real-world warehouse valuable on its own or is it valuable just for the materials and or products that are stored therein? And those products and materials, are they valuable if they just sit there forever or are they only valuable when they're taken out and sent somewhere where they're actually used? Same thing with data warehousing.

The first role of data warehousing [00:03:00] was really just don't lose any of the data we've been generating. The line of business applications that run our businesses are generating data all the time, but especially in the old days those line of business systems had a habit of throwing away data after say like a year or so. You don't need this anymore for operational purposes, so we'll just get rid of it, save you the storage. So you had to grab it and snapshot it elsewhere before it went away. And of course, there were other problems too, and there still are. Line of business systems very often don't care about the history [00:03:30] most of the time, they just store current balances as of today's date, for instance. So you don't even have a year to wait sometimes. Sometimes if you don't grab the data today, it's going to be overwritten tomorrow. And then of course, querying all of your line of business systems in real time sucks and or is impossible.

So if you ever want to make use of your data in the future for any analytical needs, yeah, you better have a data warehouse. Again though, the data stored in the data warehouse is a means to an end. It [00:04:00] is meant to power other things. Analysis, which leads to insights. And even insights are means to an end because the ultimate goal, the only one which means anything, is the improvement of the business's performance. Now, if you could just jump straight to improving business performance without all of that, you would, but you can't.

Okay, so back to the priesthood and the acronym ETL, extract, transform and load. Extract from the line of business system, transform [00:04:30] into a more informative and unified format, and load into the data warehouse. The design of the data warehouse itself of course was very important and was the subject of the aforementioned two plus pound bible, but you'd often see the terms ETL and data warehousing, almost like used interchangeably. So here's the lesson we learned then, and we're going to keep learning it over time, is that when the thing you do for a living is a means to an end and is not the end itself, you better remember that and be ready to change [00:05:00] because the data warehousing priesthood got blindsided a bit by two big converging trends.

One was the startling revelation thanks to Power BI, that a very simple star schema dimensional model was all you needed to power most business needs with the added twist that the simple star schema that you needed was often highly tailored to the specific need, the particular circumstances of analysis. So this meant that, A, you didn't need to have an elaborate data warehouse in advance, but could instead quickly build your own star schema cached [00:05:30] within Power BI's VertiPaq semantic model. And B, it meant that even if you did have an elaborate data warehouse, you still usually needed to perform some additional ETL work to optimize the schema for analysis. Not to mention, of course, the need to splice in other data sources which were not yet captured in the data warehouse.

To put a bow on all of that, data warehouses were often unnecessary and often incomplete. At the same time the storage world went through a massive revolution starting with Hadoop and culminating today in things we call [00:06:00] data lakes, which made the task of don't lose any data much, much simpler than it had ever been before. You no longer needed a dimensional model to capture all of that data. You could just capture it all in semi-raw, ugly file-based format and you could do so quickly and cheaply. And then later when the time came to do some analysis, you could then decide what your simple star schema needed to look like and relatively quickly generate it with a modern set of ETL tools like Power Query.

Data warehousing [00:06:30] didn't end, but it sure changed dramatically. The number of overall projects did rise, but the stylistic mix of products changed by a lot. Most projects ended up being data lake projects first rather than SQL based like data warehouses had always been. And dimensionality became something you did later, if at all. So today you can still buy Kimball's bible, but it was last updated in 2013. ETL used to be a career choice. [00:07:00] Now we still do all of those things E, T and L, but we do them so differently and at such different points in the workflow of a project that we don't even really talk very much anymore about ETL as like a thing. The ETL priesthood, who vehemently opposed my faucet's first methodology didn't exactly die, it just did kind of a dissolve fade into the next phase of our industry's history.

So imagine my amusement Saturday night when I sat back from my desk [00:07:30] with a huge smile on my face and said, oh my God, we're all ETL now. Here's what I was doing that led me to that amusing observation. I've been working again on the Power BI semantic model that we use to track my wife's medical situation. If you want to hear more about this model, we released an entire episode about it in June of last year, 2024. Well, Saturday night I found myself staring at a conditionally formatted table visual trying to spot correlations, correlations between about a half dozen different [00:08:00] symptoms and four or five different inputs like variables that we control.

Scanning for correlations with one's eyeballs is unreliable even when you do have good conditional formatting, unless of course the correlations are fairly obvious. So the right move in these situations when you don't have obvious correlations is to write a linear regression measure comparing each input to each symptom. This is a lot of tedious work. Writing those formulas is not fun and writing 60 of them is even [00:08:30] less fun if you want to compare every possible pair of symptoms and inputs. And calculation groups while being better, well, those still aren't fun either, which is why I haven't done a good job to date on correlation analysis.

But when she has a symptoms flare up like she's been having for the past week or so, and this flare up comes just completely out of the blue and the primary symptom of the flareup being insomnia, which kind of wrecks her life, I find myself highly motivated to try new approaches. Necessity is the mother of self-reinvention. [00:09:00] As an aside, if you're unfamiliar with the Blind Melon song Change, I highly recommend looking it up. It's one of my favorites. We lost Shannon Hoon far too soon. End of aside.

And then it hit me. I'll just export the table visual, its contents to CSV and then feed the CSV to ChatGPT and ask it to analyze all the possible correlations. And boy, it did it. It was effortless. So easy and light to just [00:09:30] paste some CSV and ask a conversational question. And it returned me some very interesting results, namely that there are no correlations between any of her symptoms and any of the inputs we've been carefully managing for the past several years.

Now, you might think that having one's working hypothesis so thoroughly invalidated, as I did Saturday night, that might sound discouraging. But instead it allowed us to seek other hypotheses and approach the problem very differently. And the [00:10:00] very next day we started doing something new that we'd never tried before, and it helped. Even having your theories rejected is valuable information. So there's a happy ending there. We've talked a lot about this thing before, by the way, that lowering the cost of performing an analysis doesn't just make the analysis quicker, but it also makes it far more likely that you even do it in the first place. Thank you, ChatGPT.

But why did I laugh and conclude that we're all ETL now [00:10:30] and why was I happy about that? Well, it comes back to data, data and metadata, but when you hear me say those words, I want you to bold the word metadata in your head. On the surface, the data model I've created for this purpose isn't very impressive. The entire PIVX file is like 700 kilobytes on disk. It contains at most a thousand-ish rows of fact data, and all of that data is hand entered. It's collected in several different Excel online workbooks that I edit on different wavelengths [00:11:00] like daily, weekly, monthly.

But even in this seemingly primitive situation, there is no way that ChatGPT could have been fed the raw data and have it come to any of the conclusions it gave me on Saturday night. You see, by the time I export the data from that table visual in my report, the numbers coming out are meaningfully different than what went in. For instance, one of the inputs we track is what's her bloodstream level of medication X each night? Well, that has to be calculated. Every [00:11:30] single dose she takes is timestamped in the hand entered data because there's an exponential decay curve according to the half-life of the medication. So her bloodstream level of the medication at a given point in time is essentially a lifetime to date sum X with the interior of the sum X calculation being the power function expressed in hours since the dose multiplied by the size of the dose.

In order to have a valuable metric that actually correlates to anything real that has meaning in our life [00:12:00] we need to perform that relatively sophisticated calculation. And there's another input dealing with a different medication, we'll call it medication Y, but in this case it's not the bloodstream level that matters, but how much we've been reducing the dose recently. So there's a weighted average of all the reductions we've made in the past 14 days with each reduction weighted by how long ago the reduction was made so that more recent reductions count as more.

As a third and final example in the model, it uses inactive [00:12:30] relationships and the use relationship function so that certain symptoms and inputs line up properly. It's a long story, but the way that's most convenient to enter the data is one that ends up capturing some of today and some of yesterday in the same row of the spreadsheet. So when it's time to analyze properly, I need to time shift some of those metrics by a day. Otherwise they won't line up with the input values, which may or may not influence the symptoms.

Okay, so just like I told you to bold the word metadata [00:13:00] in data, data and metadata, let's go back to ETL and bold, the T. Those three things I described above, the exponential running sum, the weighted average of reductions and the day shifting semantic, are all examples of transformation. I'm transforming raw data into the things that actually matter. If you think for a moment of formulas and relationships as metadata that's enhancing the raw data, that's unambiguously imbuing that raw data with meaning [00:13:30] in the human plane, then you're on the right wavelength. And this is pretty different from the traditional transformations that were part of traditional ETL.

I mean, sure, there were definitely times in ETL, traditional ETL, where we did imbue some meaning into the data using formulas, but that was more the exception than the rule. Mostly what we did in transform back in those days was reshape. Whereas what we do in a Power BI semantic model is mostly imbuing, meaning we're collecting data from multiple sources [00:14:00] in one place and enhancing it with very, very important metadata. If you're a Power BI professional, your job is primarily to perform that enhancement, to imbue the data with meaning in an unambiguous manner that Power BI can interpret. That Microsoft returned to calling these things semantic models is actually that kind of the rare example of them getting it right naming-wise. Well, except for being a little bit nerdy, but we can't expect too much, right?

So this experience tied together a lot of threads [00:14:30] for me. Staring at dashboards looking for meaning is often incredibly useful, but in other ways it's often incredibly primitive. My eyeballs can't even process a 40 row table visual and conclude whether there's a pattern, and that's a situation where there aren't even any slicers. If there were slicers involved in the dashboard, a 40 row table visual quickly becomes hundreds if not thousands of different slices. And I'll never look at all those combinations. Wouldn't it be more effective to just have a gen AI [00:15:00] agent periodically scanning the output of my data model at that level, looking for things that I might need to know?

So the way we consume our data models very well might change. But for the foreseeable future, I won't be able to just feed raw data into the AI and ask it to just go figure it out, I'll still need to convey the semantic rules somehow. Instead of an AI, imagine you had a person who was amazing at Power BI, was amazing at statistics, and whose time cost you nothing, and [00:15:30] who responded as quickly as ChatGPT. Just imagine it's a person. You'd still have to explain things to that person too, right? Meaning. Power BI professionals translate meaning into unambiguous rules and merge those rules into the data.

Yeah, we're the meaning transformation. And even if and when some reasonable percentage of dashboard usage gets replaced by AI scanners, and even as AI gets better and better at helping us write [00:16:00] the rules, which capture that semantic meaning, the role of meaning translator remains pretty durable. Yeah, we're all ETL now, but when the T becomes more translate than reshape, I think the future is bright for those who are willing to adapt.

Check out other popular episodes

Get in touch with a P3 team member

  • This field is hidden when viewing the form
  • This field is hidden when viewing the form
  • This field is for validation purposes and should be left unchanged.

Subscribe on your favorite platform.