Making Sense of Spark When You “Only” Have a Power BI Background, w/ Microsoft’s Chris Finlan

Rob Collie

Founder and CEO Connect with Rob on LinkedIn

Justin Mannhardt

Chief Customer Officer Connect with Justin on LinkedIn

In this week’s episode of The Raw Data Podcast, we’re excited to reconnect with Chris Finlan, a pivotal figure at Microsoft in the evolution of SQL Server Reporting Services and now a leading advocate for Power BI. Chris invites us on a journey from his early career breakthroughs to his current pioneering work in data analytics.

Joined by host Rob Collie, Chris revisits an extraordinary story: a data model he crafted over a decade ago that still powers significant operations at a major organization. This isn’t merely a conversation about the durability of technology—it underscores the asting impact of thoughtfully designed data solutions that continue to perform year after year.

If you’re curious about how innovations in technology continue to shape business today, or if you love hearing the personal stories behind the tech, this episode is definitely for you. Tune in for a perfect blend of nostalgia and insightful reflections, delivered in Chris’s unique style from his experiences at Microsoft. Be sure to subscribe to The Raw Data Podcast on your favorite platform for more down-to-earth conversations about data, tech, and biz impact, and join us in exploring the fascinating intersection of technology and practical business solutions.

Also in this episode:

The Persistent Power of Pagination, w/Microsoft’s Chris Finlan

Episode Transcript

Rob Collie (00:00:00):
Hello, friends.

Today, we welcome my good friend, Chris Finlan of Microsoft. Actually, we welcome him back because this is his second appearance. In fact, he was the fourth ever guest on the show on October 20th, 2020, long time ago.

I checked the date on his last appearance because during our conversation this time we were talking about his "new" role at Microsoft, but he's been in this role a lot longer than I'd realized. Since 2021 in fact.

Last time he was here, he was working on paginated reporting and its integration into the Power BI product as like lift and shift operations go, integrating something like paginated reporting into something like Power BI and the subscription and the whole cloud service thing for the first time ever and that required a lot of ingenuity, a lot of customer interaction, and also a lot of tenacity.

So when Microsoft faced a similar but probably even harder problem, which is how to integrate the Spark compute engine into the Power BI and Fabric model of operation, Chris was a natural choice to get involved.

So that's what he's responsible for now. He manages the team of product managers that are uniting these two very distinct, historically anyway, distinct worlds of the Spark and data science workload on one side and the Power BI and the analytics and the reporting side that most people listen to this podcast are much more familiar with, I suspect. And of course, Fabric is all about unifying these two worlds, bringing them slowly, closer and closer together and integrating and unifying them.

Speaking of familiarity with reporting and analytics, Chris and I have similar backgrounds. We both grew up through that space, which again, I think is the majority of people listening to this have similar journeys and are less familiar with things like Spark. So things like Power BI and paginated reporting are very familiar to Chris, very familiar to me.

But when Chris took the job a few years ago, he actually knew nothing about Spark. Of course, that's all changed today. He knows quite a bit about Spark now.

And that struck me as a tremendous opportunity, a chance to get my head better wrapped around this thing called Spark, what it's really about and when and why you would use it. And I definitely left this conversation having at least succeeded in large part with that. I now have a much better grasp of how Spark fits into the picture, even though I'm still unlikely to be using it anytime soon. That knowledge is really going to help me.

So if you're in a similar spot coming from that reporting and analytics background like me, I hope you find this valuable.

But that's not the only thing we discussed. We talked about making the switch from individual contributor to manager, how operational cost is now an intrinsic part of software design and engineering.

We talked about the selection of hardware for data centers. Moved on to Schwarzenegger movies, also a killer project he and I collaborated on in 2014, cheating at Scrabble GO, the hidden gem in Excel known as cube formulas, and we pretty much laughed nonstop.

Folks, it just felt great and recording this was absolutely the highlight of last week for me.

So in hindsight, it was far too long in between his appearances on the show and we're going to have to have him back on a bit more often. So join me now as we reset the weeks since Chris Finlan was on the show counter from 188 to zero.

Speaker 2 (00:03:24):
Ladies and gentlemen, may I have your attention, please.

Speaker 3 (00:03:27):
This is the Raw Data by P3 Adaptive podcast with your host, Rob Collie, and your co-host Justin Mannhardt. Find out what the experts at P3 Adaptive can do for your business. Just go to Raw Data by P3 Adaptive is data with the human element.

Rob Collie (00:03:53):
Welcome back, Chris Finlan.

Chris Finlan (00:03:54):
When was I on before?

Rob Collie (00:03:56):
I don't know. 10 years ago? The podcast is only four years old, but I think it was 10 years ago.

Chris Finlan (00:04:02):
Ten years ago, you and I were out in L.A.

Rob Collie (00:04:04):
Oh my God. It was roughly 10 years ago. One of the monster projects, which is still running to this day.

Chris Finlan (00:04:10):
Is it really? It's amazing.

Rob Collie (00:04:12):
That model is running. When you say the project is still running, it makes it sound like a traditional project, the kind that sucks resources forever. No, not that sense. That SSAS tabular data model is still operational.

Chris Finlan (00:04:23):
Is it really? That's unbelievable.

Rob Collie (00:04:24):
It's still powering, as far as I know, that entire floor of analysts in this large organization that we're talking about. We did do a tune up. Our favorite person at that client, she was threatening retirement 10 years ago, still not retired.

Chris Finlan (00:04:40):
Really? Oh, wow. Good for her.

Rob Collie (00:04:41):
And came back to us and said, "Hey, the model isn't working anymore. We need to fix it and then while we're at it, let's see if we can speed it up." Turns out, yes, it could be sped up because you think 2014 Rob was writing Optimal DAX. I don't think so. Yeah. Every one of those measures starts with one of the ugliest cascading ifs.

Chris Finlan (00:05:02):
Oh, really?

Rob Collie (00:05:02):
Of all time. If has one filter of this, if has one filter of that.

Justin Mannhardt (00:05:07):
This was us. We came back in over top of Rob's work. I can imagine just a consultant being like, "Who the hell wrote this?"

Rob Collie (00:05:15):
Measures that were originally named with the word magic in the name. They were in the model with the word magic.

I came back one time a year later and they're like, "Can you get the word magic out of these measurements?" I'm like, "Fine." But it's a disservice. Stuff was magic.

That's still one of the coolest models and I couldn't have done it without you, by the way.

Chris Finlan (00:05:36):
Well, yeah, I know that.

Rob Collie (00:05:37):

My Power Pivot skills are struggling with this SSAS tabular environment. Thank God you were there. That was a collab.

Chris Finlan (00:05:49):
It was a fun project.

Rob Collie (00:05:50):
And those poor people still trapped.

That's one of the things I meant to... Actually, that's joke I meant to make and I've already ruined it. It was like when we sent our consultant back out to fix my work, you found those people were still there in that parking garage trapped between those two gates.

Chris Finlan (00:06:06):
One of the funniest faces you've ever made, the evilest laugh you've ever had at breakfast, telling me the story of these poor people who were stuck between the guard gates.

Rob Collie (00:06:17):
Two parking gates. One parking gate went up to let them out, but then one car linked space before the next parking gate and the parking gate goes down behind them and the one in front won't go up and they're stuck in this Austin Powers' situation and they're asking me for help as I walk by. I'm like, "I can't help you." As far as we know, those people just starve to death there or eaten by wolves.

Chris Finlan (00:06:42):
He's telling me this story. He's like, "Yeah, they're really screwed."

Rob Collie (00:06:48):
"Can you help us?" No.

Chris Finlan (00:06:50):
Couldn't pass any number of security people in the way. No.

Justin Mannhardt (00:06:53):
Chris, these people are going to die out there. Can you pass the butter?

Chris Finlan (00:06:57):
The lack of empathy was remarkable.

Rob Collie (00:07:00):
In the foothills of L.A. with coyotes and... Anyway.

So I wanted to... As a bit of perspective, I think this is really awesome. So when you and I met many years ago, Power Pivot was the only thing. All this stuff we're talking about, Power BI, the Power Platform, Power Apps, Fabric, all of this stuff, none of that stuff existed. There was only Power Pivot.

We didn't have even Power Query yet. When you and I were working in SSAS tabular at this one enterprise customer, there wasn't any version of Power Query accessible to SSAS Tabular.

Chris Finlan (00:07:32):
That's correct.

Rob Collie (00:07:32):
It had arrived in Excel, but nowhere else.

Chris Finlan (00:07:35):
These were the days of the never talked about Power BI for Office 365, which has just been scrubbed from the memory of everyone.

Rob Collie (00:07:43):
That's right.

Justin Mannhardt (00:07:45):
See that dot on the timeline? Let's just get rid of that dot.

Rob Collie (00:07:48):
That's the Highlander II of Sean Connery's career.

Chris Finlan (00:07:53):
Wow. That's rough.

Justin Mannhardt (00:07:56):
I didn't know we were having a roast, but...

Rob Collie (00:07:59):
You start off in this Power Pivot world of seeing that things are going to change. You and I glimpse something similar and that's why our paths crossed. Next thing you know, you're at Microsoft Redmond, you're in corporate working on the product teams. When we met, you were in the field.

Chris Finlan (00:08:15):
That's right.

Rob Collie (00:08:15):
Next thing you know, you're the paginated reporting person.

Chris Finlan (00:08:19):
That's right.

Rob Collie (00:08:19):
You even have a bear mascot.

Chris Finlan (00:08:22):
Actually, I don't know where the bear is. He's around. His name is actually Mickey, to be fair.

Rob Collie (00:08:26):
Really? There's not even any alliteration. Paginated reporting bear should be named like Pauly. Pauly, the paginated reporting bear or something.

Chris Finlan (00:08:33):
Okay. So first of all, it's paginated report bear. There is no reporting bear. Let's be factually correct here.

Rob Collie (00:08:39):
Man, it just shows you how clumsy I am. I'm sorry.

Chris Finlan (00:08:41):
Feel you're actually that sorry.

Rob Collie (00:08:45):
In the data analytics reporting game, Power Pivot and paginated reporting, and they're not like enemies, but they're opposite ends of a spectrum.

Chris Finlan (00:08:55):

Rob Collie (00:08:55):
And now you're in a completely new place.

Chris Finlan (00:08:58):
I am.

Rob Collie (00:09:00):
And this is a place that I personally have very little direct experience with. Justin does. I know it's a grand simplification to call this your third act. Tell us what you're up to today.

Chris Finlan (00:09:09):
Sure. This will be my 11th year at Microsoft. I'm coming up on anniversary next month.

Justin Mannhardt (00:09:14):

Rob Collie (00:09:15):

Chris Finlan (00:09:15):
And I moved out to Redmond in 2015. First worked on the Datazen product, which was integrated into SSRS and called Mobile Reports and then I eventually took over that team as the PM lead and then I was on that team until late 2021.

One of the things that I did as part of that team was I was the first workload to integrate into what was Power BI Premium. That was not the data set.

As you can imagine, for a platform that was designed for the Power BI data set to integrate effectively SSRS into that platform was a big challenge. Just the back and forth and the things you needed to know with the platform team in terms of how does billing work, how does deployments work, what does it mean to have trains that go and deploy features.

All of those pieces that we need to go think about in terms of just integrating into the platform, I had a unique experience of being a PM who had gone and done that, which you wouldn't think of if you think about the paginated report guy and all the features that are associated with SSRS, but it was specifically those kind of somewhat argued mundane details which are actually super important. Guess what? Billing is pretty important. Funny how that works.

Rob Collie (00:10:33):

Chris Finlan (00:10:34):

At the time I was looking for a change and it just so happened that Justyna Lucznik, who is my manager... Actually, we were on the same team, we reported the same manager at the time. She moved over to what is now Microsoft Fabric. She was the lead of the Synapse Spark team.

As you can imagine, the integration of Power BI and Synapse was... It's a lot of change and churn bringing those two teams together. But one of the things is as guess what? They wanted to go and build what is now Fabric by integrating all of these things in Synapse into what was Power BI Premium platform.

Well, guess what? They happened to know a PM who had done exactly that.

Justin Mannhardt (00:11:17):
We've been down this road.

Chris Finlan (00:11:19):
But I didn't know the first thing about Spark, but I spent a ton of time figuring out how to go integrate a workload into the platform.

Rob Collie (00:11:26):
That's fascinating.

Chris Finlan (00:11:28):
And so Justyna was... When we first had the call, Kim Manis had us have the call because she knew I was interested in doing something. Neither one of us expected it would be a good fit for me. By the end of the call we're like, "Well, this is a perfect fit."

It's an opportunity for me to go do something different to leverage the experience I had, just dealing with business planning and things like that. These very basic things that are actually quite challenging at Microsoft once you get into the details of how you go and put together a price and make sure it's built properly and all of those things, and what does it mean to deploy a Spark cluster and have it translate to CUs now on the platform. So by the end of that call, it was like, "Look, I want to bring you over."

And one of the things is I was in this interesting situation where I started this journey of being a people manager. It was something I had been told for quite some time that people thought I'd be good at it, but I was in a situation where I was a people manager. I was managing two people.

And I recently gave a talk about transitioning from being an IC, individual contributor, to a people manager. And I can tell you that they're completely different in terms of the skillsets. It's not even close. The things you have to worry about from one versus the other are night and day.

But the challenge was is that I was effectively having to still act as an IC while I was a manager and I didn't feel like I was doing either very well. So as part of this transition I was like, "Look, I want to take on a larger team and be able to actually make that transition to being a manager," and that was incredibly challenging to go and truly make that transition from being an IC to a manager.

Justyna was going through the exact same journey, so we would clash at times because I was like, "Stop lurking."

When you're a really strong IC or really strong individual contributor, you know can go fix the problem, but you have to be able to delegate that to other people. That's just a skill you have to get comfortable with and learn.

So I joined the team and there were certain things that because of my expertise, I knew very well and I was able to drive certain things, but going and empowering my team and give them responsible areas of ownership.

At the time, I was the owner of the Spark platform integration team is what we call, so if you think about things like admin governance, like the monitoring solution for Spark, just the admin settings that you see in the portal. When you buy a capacity, what does that translate to when it comes to a Spark cluster, things like that.

I had a team of three or four people, little over a year. And then the summer of last year they combined a team with mine. So now I have a group of eight reports and I own the data engineering space.

Justin Mannhardt (00:13:44):

Chris Finlan (00:13:45):
You think about things like Lakehouse, monitoring, migration, hardware, deployments for Fabric. It's such a broad area of ownership. Spark runtime, all the things you do there. Delta, that all falls under me.

That's a lot of very complex areas run by very smart individuals and my job is not to get in their way. My job is to make them successful and let the experts go and do the things that they're good at. While I'm ostensibly the owner of those things, I specifically don't call myself the expert. I have really smart individual contributors that I want to make sure that they're successful and so that's my job is to go and help them as best I can.

Now of course there are certain things just because of my experience in the areas around billing and that whole part of the business that I still stay relatively hands-on with.

But one of the things that's been frustrating for me is just I don't want to feel like a dummy in a meeting. It happens. Somebody could ask me a question on the Lakehouse and am I better than I was a year ago? Absolutely. Am I nearly as deep as the people on my team? No. That's what I'm doing. That was a long answer to a long question.

Rob Collie (00:14:49):
That was an awesome answer. It's long answer, but packed.

Chris Finlan (00:14:55):
I'm a PM. [inaudible 00:14:56]-

Rob Collie (00:14:56):
That's good stuff.

A few things jumped out at me from that. Number one, the place you're at right now sounds structurally similar to the place I was when I was at the end when I was on Excel and it's like the era where I look back and I say, "That was when I was best as a manager."

Chris Finlan (00:15:12):
Were you a people manager? I didn't know you were a people manager.

Rob Collie (00:15:15):
I was. So it's a long story, but the pinnacle of it was when I worked on Excel with the BI emphasis, I had people like Allan Folting reporting to me and he was the one overwhelmingly who revamped the Pivot Table experience with me in a support role.

Howie Dickerman reporting to me doing cube formulas in Excel.

Chris Finlan (00:15:37):
Still can't find them.

Rob Collie (00:15:38):
Yeah. Hidden behind that button that I named and decided to keep it safely tucked away from humanity forever.

Chris Finlan (00:15:46):
Mission accomplished.

Justin Mannhardt (00:15:48):
Protect the mortals.

Rob Collie (00:15:49):
It's gatekeeping, is what it really is.

Chris Finlan (00:15:53):
It's like a velvet rope in Excel.

Rob Collie (00:15:55):
Do you even cube formula, huh? Yeah, you bet you don't even cube formula. Yeah.

Well, I'd tell you where to find the button, but you wouldn't know how to use it.

Justin Mannhardt (00:16:02):
I always thought cube formulas were a specific sector of mathematics that I didn't understand.

Chris Finlan (00:16:07):
They might as well be.

Rob Collie (00:16:07):

Chris Finlan (00:16:07):
[inaudible 00:16:11].

Rob Collie (00:16:14):
Yeah. It's no lambda functions. Okay? Put it that way.

Chris Finlan (00:16:17):
Well, to be fair, the first time I discovered cube formulas, I think it was reading one of your books and I was like, "Oh my God, you could do this? Why isn't this more obvious?" Then I was like, "Oh, I met you." And then I was like, "Well, here we go."

Rob Collie (00:16:28):
Yeah. Now you know why.

You were in a competitive situation where a customer was saying, "I need click view to do X, Y, Z." And when you discovered cube formulas, you're like, "No, you don't. Check it out." It was a game changer and I'm like, "Yeah. I got to keep some power in reserve here."

The one time as a manager at Microsoft that I really put the IC part of my job aside and tried to empower and support the people who were truly doing amazing work, I found it somewhat unsatisfying in the end. The imposter syndrome started to kick in as a result because I wasn't doing anything.

That role at Microsoft is one of the most important things that I did as part of my resume. That's something that I need to bring up. Part of my bona fides. And at the same time, I'm like, "Well, I oversaw the overhaul of Pivot Tables," which is a huge thing. Allan really did. So it's tough.

Chris Finlan (00:17:23):
It is. And it's interesting because I'm certainly a unique individual in terms of how I approach things. Nobody says, "I want a second Chris Finlan." They like having one but they don't want a second one.

Rob Collie (00:17:32):
Let's not get carried away.

Chris Finlan (00:17:34):
They might not even like the one.

But for the first few months that I was in my role, I specifically didn't attend any meetings that my team was driving because what happens is if I attend the meeting, I've immediately disempowered them, not because I want to, but because they're like, "Oh, I'm going to talk to the person who's the highest ranking person on the team there."

And so it is a very weird feeling because you're like, "I don't feel like I'm doing anything," because what you're doing is very different than what you do as a high-performing IC. You get to a certain point as an individual contributor. It's like, "Okay, well, we're going to make you a manager now because you limit of your impact you can make as an IC. So you move into management." And some people don't make that transition very well.

And it really was breaking myself of those old habits and it's a different type of accomplishment. Now I get really excited when I see the things my team are able to accomplish and where I can go and best make them effective. It really does bring me a different level of satisfaction than it would've say two, three years ago. And again, that was that transition from being an individual contributor to this.

And it's hard. It really is. I still enjoy getting my hands dirty at times and jumping in and doing things, but if I tried to do all that stuff, I wouldn't be an effective manager.

Eight people is a lot of people to have directly reporting to you. So all the different areas of ownership and all their different personalities and they want to be successful, they want to grow in their career. How do I get them to the next level? How do I make sure the work's divvied up properly? How do I give them opportunities to be successful? How do I give them proper coaching?

It's really hard, and that's something that I was not necessarily very good at. Just organizing myself. I asked somebody during their one-on-ones recently who wanted to be a manager. I was like, "How would you manage yourself?" He was like, "Oh, that's a good question."

Rob Collie (00:19:26):
Clearly you need to run your team like mafia style. If you want to move up, you've got to take out the big man.

I'm standing in your way.

Justin Mannhardt (00:19:35):
I see the path clearly now, Rob.

Chris Finlan (00:19:37):
It's funny how different style Justyna and I have because she hates, hates, hates, hates being the bad guy. She's one of the nicest people you'll ever meet, so she'll bring me into a meeting to be the bad guy. I'm like, "I really should charge extra."

Rob Collie (00:19:53):
That was the Excel team's job in Office by the way.

Chris Finlan (00:19:55):
Was it really?

Rob Collie (00:19:56):
There was this expected equilibrium where if there was a bad idea that was gaining momentum in Office, like Word, PowerPoint, they'd all just be sitting back like, "Seems like a good idea." Privately behind the scenes saying, "I don't know about this." And the Excel team would eventually be the one to say, "Yeah, this doesn't work."

Chris Finlan (00:20:18):
That is not a great position to be in because eventually you get branded with you're a troublemaker or you're just a contrarian.

Rob Collie (00:20:25):

And you think about it totally makes sense it's the Excel team. The data-driven team, the one that's doing the analysis, that mindset. Think about the average analyst trying to speak truth to power and generally being resented for it around the world, everywhere. That was happening within the Office org itself.

Chris Finlan (00:20:48):

Rob Collie (00:20:48):
So there's one other thing that I really wanted to amplify in your story there is that in the course of bringing paginated reporting into the Power BI product before your team's efforts with paginated reporting integration into the Power BI SKUs, there wasn't any way to buy it or use it through the same channels as buying Power BI. There wasn't a flavor of Power BI you could buy that was Power BI with paginated reporting.

And so like you mentioned, a lot of those mundane things end up being a tremendous amount of the work. How can we make it so that it's viable and what those SKUs look like? What are the options on the menu? Probably need to add some. How much should they cost is a tremendously difficult question.

And by the way, when I was there at Microsoft, we the engineers never had to have any involvement in that because the cost of goods sold was basically zero. We weren't running cloud operations. Microsoft didn't own any hardware to run the stuff.

Chris Finlan (00:21:50):
No, that's a great point.

Rob Collie (00:21:51):
But it's an engineering problem now to know how much it costs and also by the way, to engineer in a way that it minimizes the cost to Microsoft. You can build the software in a way that costs a lot more to run. So engineering now has to be involved in setting the price.

Chris Finlan (00:22:06):

Rob Collie (00:22:07):
And so for a reasonably long period of time there, from the outside, Chris Finlan and licensing were almost synonymous. You had a lot to do with licensing.

Chris Finlan (00:22:19):

Rob Collie (00:22:19):
They had gone and figured out what the licensing and pricing and all that should be for Power BI, but this is a completely different workload and so even if we just zoom in for a moment, we can call it licensing, we can call it billing, but it's also pricing and that is not a purely like MBA type of exercise, which is what it was for me when I was building software back in the day. Just give it to the product management team, they'd figure out what the market would bear, and the cost of goods sold was essentially zero. They knew how much it cost to manufacture the damns DVDs.

Chris Finlan (00:22:52):

Rob Collie (00:22:52):
The DVDs weren't heavier if they were loaded with more valuable products.

So when it comes time to take another non-Power BI workload, regardless of what it is, the process of figuring out how much to charge for it is such a well-worn path for you at that point that it makes sense for you to have an important role in that.

That just really clicked for me. I didn't get it while you were doing the thing, you're doing, the Spark stuff and everything, is/was as foreign to you as it would've been to me, but it totally makes sense for you to be central in that story. Even if we just focus for a moment just on pricing, just on licensing, which I know is only one of the mundane things, but it's one of the most visible examples of that for the listeners.

Chris Finlan (00:23:38):
Yeah. That's absolutely the case.

And it's funny you mentioned COGS because you'd asked me five, six years ago, "You have to think about COGS in a way," and I was like, "Why do I have to worry about that as a product manager?"

Rob Collie (00:23:50):
"I'm just software."

Chris Finlan (00:23:52):
And maybe the engineers have to worry about that because the MVP Community does associate me very specifically with the premium per user SKU with Power BI because I've ran that work stream for almost two years.

Rob Collie (00:24:05):
Wow. If we were going to introduce you in the ring, would you be like Chris-

Justin Mannhardt (00:24:08):

Rob Collie (00:24:14):
... PPU Premium.

Chris Finlan (00:24:15):
I don't think beyond the MVP community they would necessarily... I've certainly done podcasts and other things where I've talked about that, but I think generally paginated reports considering I still get mails from people at Microsoft asking me to jump into a paginated reports thing because they don't know I've switched roles. I left in November 2021. Yeah, I still get mail. So I assume that that would be the first thing on my tombstone.

But if you think about having to understand what the COGS are and how that affects the price because you have to understand the margin.

Let's bring it back to the Spark world, so I'm sure at least one of you on here is familiar with the starter pools.

Rob Collie (00:24:48):
That person's name is Justin.

Chris Finlan (00:24:50):
Everybody loves starter pools except me because I look at the COGS bill and then...

Justin Mannhardt (00:24:54):
Right. There it is. We love that. It's so fast though, Chris.

Chris Finlan (00:24:59):
And that's the thing. Ultimately, my job is to make sure that works for us in terms of we're able to provide that service and still be a profitable business. So that's something where it's not a secret that like, "Hey, you've got all these machines all over the world sitting there ready to be used." There's going to be costs associated with that.

So how do you best manage those costs? What are the things you can do in an engineering perspective to make sure that we don't have too many machines there? How do you go and estimate those things?

That's not something... When you're going through the PM interview loop, you spend a lot of time going into.

Justin Mannhardt (00:25:31):
You think about the complexity of the pricing model. If I just overly simplify what's happened, a lot of the workloads that were in Azure Synapse among other things, they've come over and those workloads are now represented in Fabric in some shape or form. And it's a provisioned pricing model. There's a F2 through, what's the big one? 2064 or something, right?

Chris Finlan (00:25:50):

Justin Mannhardt (00:25:51):
Yeah, 2048.

You pick one of these and so you not only had to think about, "Well, how the heck are customers going to actually use these different things?" Because you could pick any one of those SKUs and have wildly different consumption patterns from different workloads from any given customer, then for customers to figure out what's the right thing for me based on what I'm trying to do. This not an easy thing to crack.

I'm just curious, have you seen anything in the process where you're like, "Oh, we didn't expect that?"

Chris Finlan (00:26:17):
Yeah. I came over from the Power BI world to Spark and Synapse. I look at my very first spec that I wrote when I came over and I'm like, "Wow, this is a piece of crap," because I didn't know what the hell I was talking about, but I wanted to get something down on paper to get my thoughts, and it was like, "Well, you have to manage everything at the capacity level." There is no concept of workspace level management for these things.

Justin Mannhardt (00:26:37):

Chris Finlan (00:26:38):
And interestingly enough in Synapse, the idea of a workspace is very different than what it is in Fabric. It's night and day.

In Fabric, you have workspaces all over the place. In Synapse, you don't have more than a couple of workspaces generally.

But in any event, I was like, "Oh, you have to manage all this stuff at the capacity level and the workspaces all just feed off this." They're like, "That won't work." What do you mean though it won't work? We're bringing all this stuff in the Power BI. That's how it has to work. And I was told very clearly, "Nope."

And Justyna went through the same thing. Anybody who came over from the Power BI world had to go and figure that out.

And with Spark especially, it's a serverless offering. It's the one workload in Fabric where you go and you provision your compute, you have full control over your ability to go and provision Spark pools.

That was a very deliberate decision and that was at the time fairly contentious, but I think ultimately it was the right one because it was very clear that that's what the customers were asking for and what we needed to do.

And then how do you translate that to a capacity? It's actually fairly easy for people with Spark to say, "Oh, it's two cores per CU." It directly translates to... That simple. So for people who are looking at it coming from the Synapse world, it's very straightforward.

Now, obviously one of the beauties of Fabric is you can do all these different workloads and they don't have the same luxury in terms of being able to do that direct translation.

But for us, this was a complex problem and you have the ability to manage these pools in a workspace and you'll have the ability soon to manage them at a capacity level so that all the workspaces have to use the same pool configuration. And then what does that mean in terms of the number of concurrent sessions and how does that fit into a capacity and with bursting and smoothing and all the other things with billing. These are very, very complex problems that took a long time to work through.

I'm very proud of the work the team did. I'm not taking the credit for this. I have very, very smart PMs who went and had to actually get into the weeds on this stuff and figure out the nuts and bolts of making this work, but I'm very proud of the work that Spark team's done to bring those two worlds together.

And there's still a lot of work we need to do, but it's an interesting concept for people to wrap their head around in terms of what exactly that means because I'm sure the people on here know that this is an evolution from Gen 1 Power BI back in 2017 with a blog post from Kamal basically saying like, "Hey, you're getting this capacity that you can now share across the organization and pay this one price. And it kind of works like SQL in terms of the licensing so you just pay for cores as opposed to users."

Rob Collie (00:29:06):
Before we get too much deeper, there's a boiling down of the thing we just talked about that I think would help me and probably also some percentage of the listeners, because you come from two products that I think are much easier for people to understand. People who are listening to this show are much more likely to have come up through a Power BI and/or maybe even paginated reporting background than they are to have a Spark background. And same with you, right?

Chris Finlan (00:29:34):

Rob Collie (00:29:35):
So you are someone who is walked and is walking that same path, and so people on here listening, they're like, "What the hell is Spark?" It's like the other side of Fabric. So at a really, really high level view, Fabric is trying to unite these two completely different communities, these two different types of data work that have been going on in separate ecosystems forever.

I could live in my Power BI world insulated forever and never have to even hear about... The whole term data engineering came up largely behind those workloads, not behind the Power BI workload. Now, we do have data engineers here. The role and the expertise is still highly relevant. How does the other half live, right?

Justin Mannhardt (00:30:29):
Tell us about your customs.

Rob Collie (00:30:30):
Take us to your leader.

This vast community of data science and AI and all of that that is spoken in completely different languages, and Microsoft is very explicitly trying to tear down the wall between those two. And one of the things that they did in that process is a very tiny, tiny partner is they sent you over that wall. So report back to us, help us, our listeners, who have very much your prior background make sense of this new space that you now find yourself in, in the same way as you are the world's foremost expert on Chris Finlan's journey.

Chris Finlan (00:31:06):
Oh, okay. I thought you were going to say paginated reports. I'm like, "You're killing me here."

Justin Mannhardt (00:31:10):
Rob, I prepared lots of questions about paginated reports. When are we getting to those?

Rob Collie (00:31:15):
Oh, yeah. When are you going to fix this one particular...

Chris Finlan (00:31:19):
I'll pass it along.

Justin Mannhardt (00:31:20):
What does this other world look like, Chris?

Chris Finlan (00:31:23):
Spark sensibly is for big data usage. It's for when you talk about the medallion architecture and bringing in large volumes of data and doing data transformation and making available.

Bogdan Crivat who you know, he and I will often have discussions. He's like, "Nobody just uses Spark for the sake of using Spark." Eventually, you want to show the data. Nobody just buys Spark and like, "That's all I need. Just Spark." And then I just do some transformations in there. It sits.

I don't want to give you too much credit, but I think you once made the case that all data is big data until you have to get it boiled down to something that's consumable by humans.

Rob Collie (00:31:59):
Yeah, one screen. If it's more than one screen, it's big.

Chris Finlan (00:32:02):
That's the type of thing where you need to be able to consume this. So it makes sense to bring these worlds together. And when you think about the concept of the Lakehouse and the ability to use either engine that you're comfortable with to serve and consume your data, that's one of the things where you can bring the data in, do the data processing or transformations with Spark, and then actually use the SQL endpoint.

I'm much more comfortable in the SQL world. That is where my bread and butter is. So for me, it's much more comfortable for me to go in that direction than to go into a notebook and then start doing these things.

I think one of the biggest things... And I can't emphasize this enough, I'm a huge, huge, huge, huge AI guy. I have a workspace subscription to ChatGPT, which you need two workers, and I just am one person, but I don't care. That way I get higher limits to ask questions and stuff. I use the API directly. And I use it specifically to help me write Python and do these things in terms of data transformation.

At one point last year, if you look at my blog, which is, I think every day for 40 some days, I did a blog generated by ChatGPT.

Justin Mannhardt (00:33:15):
I remember that.

Chris Finlan (00:33:16):
Some of those articles like connected to the Open AI endpoint and using it for Power BI. I just had ChatGPT-4 for spit that out and I get people asking me questions about that all the time.

I'm very clear, my blog posts, by the way it was written using this. But for me, that's what makes this...

And I think that's what'll help bridge the gap for so many people because if you suddenly... We were joking before about DAX, but I guess now it's out of the bag.

Rob Collie (00:33:43):
Yeah, yeah. You mean the part where you're a dirty, dirty scrabble cheater.

Chris Finlan (00:33:50):
How dare you? You still beat me. This is the thing he accused me of cheating and he still won.

Rob Collie (00:33:55):
I did not. I did not win.

Chris Finlan (00:33:58):
We played like 11 times. I think you were 10 and won. You clobbered me in that thing.

Rob Collie (00:34:03):
In reality, I probably thought you were cheating and then I cheated because I don't remember coming away from that thinking that I'd won.

Justin Mannhardt (00:34:11):
There you go.

Chris Finlan (00:34:12):
For the people listening to this, this happened seven years ago.

Rob Collie (00:34:16):
No. It was during COVID.

Chris Finlan (00:34:18):

Rob Collie (00:34:18):
What else were we supposed to do but cheat at scrabble during COVID?

Chris Finlan (00:34:20):
If you go back to the piece about Spark, if I have to learn a new language, guess what? You've just made this massive barrier for me to go and start using this.

And this is why, much to your chagrin, I'm not a huge fan of DAX and the complexities they're in of that beautiful language.

Rob Collie (00:34:42):
All right. Hold on, hold on. I think DAX is a great example because DAX does something that most languages don't.

Chris Finlan (00:34:48):
Frustrates the hell out of people?

Rob Collie (00:34:50):
Oh, come on.

Justin Mannhardt (00:34:52):
This is great.

Rob Collie (00:34:55):
DAX is an all time great language. I will die on that hill, but let's not fight that battle right now.

The point is if you want to build a model that can answer any number of questions at different levels of detail and cross-referencing across multiple different variable sets and things like that, and do it effortlessly once it's built, there aren't many things out there like DAX.

In fact, the only thing out there that I'm aware of like DAX was MDX, which was far, far worse. Whereas this is a question really for either of you. Justin also has his feet planted in both of these worlds.

Chris Finlan (00:35:30):
Well, hold on. I'm not letting you off the hook.

One of the things that Spark team, the data science team specifically run by Nellie Gustafsson, my work bestie, the semantic link feature.

Rob Collie (00:35:40):

Chris Finlan (00:35:42):
Do you know how popular that is? And do you know one of the reasons why it's so popular, Mr. Collie?

Justin Mannhardt (00:35:48):
Here it comes.

Rob Collie (00:35:49):
Because someone else has to write the DAX.

Chris Finlan (00:35:52):
I don't have to write DAX. I can go and use SQL to go in.

Rob Collie (00:35:55):
But someone else did. Right?

Chris Finlan (00:35:58):
I don't care about them. I care about me.

Rob Collie (00:36:00):
I understand.

But someone did write DAX and the person who's consuming it with semantic link is benefiting from it.

Chris Finlan (00:36:08):

Rob Collie (00:36:09):
But by pointing this out, you are not taking an L in this conversation.

Chris Finlan (00:36:10):
I know I'm not taking an L.

Rob Collie (00:36:16):
I'm not claiming a W either, but you're not getting a win.

Chris Finlan (00:36:19):
It's like our scrabble game.

Rob Collie (00:36:22):
Someone built something awesome, and the person on the Spark side with semantic link doesn't have to care, right?

Chris Finlan (00:36:29):

Rob Collie (00:36:30):
Just like the person who's using the Power BI report doesn't have to care.

Chris Finlan (00:36:34):
Yeah, no, that's fair. The same thing... You could make the same case with the AI point I was making. Ultimately, there's obviously a ton of work going on behind scenes I don't care about as far as I'm concerned, it's just magic.

Rob Collie (00:36:45):
But I wanted to use the DAX example because if I'm master of all tech in the Power BI ecosystem, under what circumstances would Spark be interesting to me? What business problems might I face where I'd go, "Oh, I might go need to learn how to use Spark to do this?"

First of all, I'm completely aware of the fact that there are many, many things that Spark can do in parallel to things that the Power BI traditional stack... I can't believe I'm already calling Power BI traditional. It's so funny. Power BI is the anti-traditional, it was the disruptor. But anyway.

There are many things that I could do with either tech stack, it's a question of which one I grew up with.

Chris Finlan (00:37:22):

Rob Collie (00:37:23):
There's going to be things that Spark is good at. It's a better choice than the tools that have been available to me in the Power BI ecosystem. And so can we zoom in on some of those?

Chris Finlan (00:37:31):
That's asking the question, and again, I'm wildly oversimplifying this, but if you think about the traditional SQL product, what was SSIS use?

You're using Spark to go and ingest large amounts of structured unstructured data to do data transformations for it and then make it available to your different data analysts to go in and then report against that.

And again, I'm wildly oversimplifying it, but this concept of Lakehouse is specifically like, "Hey, you've got one lake which is under the covers of where you're putting all your data and then this Lakehouse is the way that you can go and then surface and manage this in a way that using a notebook, doing any number of things to go and use the data the way that you see fit."

So you know Mike Carlo?

Rob Collie (00:38:09):
I do. Power BI Tips.

Chris Finlan (00:38:11):
Yes, I love Mike because he's the person whose foot is planted, interestingly enough, squarely in the Power BI world and squarely in the Spark world. He's one of those unique people who's strong in both, and he taught himself this and he uses data bricks, which I'll not hold that against him too much, but he's come around on the Synapse Spark and now Fabric Spark experience.

And I look at him in terms of how he uses this and his journey where he needed to go and think beyond the traditional steps a Power BI analyst would use to go and ingest and transform data because you're talking about smaller amounts of data there, but when you're talking about large amounts of data and you're talking about all the flexibility that the notebook gives you to go and do all these amazing things.

I can do enough in a notebook now to be dangerous like, "Oh, man, I actually wrote some Python and was able to go and do some of this stuff." And to me that was magical because I was like, "I have so much flexibility here on this blank canvas to go do these things."

You have an option. For example, you can use a notebook to that where it's a fairly code heavy experience or you could go use a data pipeline and do it that way, which is not a code heavy experience.

And again, the great thing about Spark is you can go and spin up the amount of compute that you need to go and transform this data and to bring it in and it's completely serverless. You're only going and provisioning what you need at any particular time. And the compute is completely separated from the storage. So once the data's in one lake, you then have any number of engines you can go and transform the data with. And Spark is just one of those that you can use.

You talk about on the data engineering side, I mentioned the data science side, I'm jealous oftentimes of the stuff.

Spark in general is used mainly for data engineering scenarios on a day-to-day basis. There's more data engineers I think than data scientists controversial.

Justin Mannhardt (00:40:00):
No. That's right.

Chris Finlan (00:40:01):
But I look at the stuff that Nellie and team get to do around data science and AI, and I'm like, "That's the magical stuff."

Rob Collie (00:40:07):
That's the faucet.

Chris Finlan (00:40:08):
That's the beauty of Spark. You have this canvas basically where you can do both and some people would argue like, "Oh, you don't need to use Spark to use Python or something." This is the type of thing where you have this flexibility to use these different languages or to do in a notebook all of these unique things. I went and connected to the OpenAI API.

I used to call SSIS like the get out of jail free card because I could basically just do whatever I needed to do to get the data where I needed to consume it in my report, and that's what I look at Spark in terms of I need to go and ingest a large amount of data and then depending on the use case, I can basically accomplish whatever I need to accomplish using a notebook if I want to go down that route.

Justin Mannhardt (00:40:43):
If you're in that Power BI camp and you're doing things in Power Query, for example, if your data model's refreshing just fine and you're cranking cool numbers and awesome insights and you're just fine, there's no FOMO happening because you're not using Spark in these scenarios.

But I remember a project I worked on, it had to do with optimizing the distribution routes of inventory in a supply chain network. And so to get the fact table correct for this analysis, recursion was necessary and it was never certain how much recursion would be needed. Eventually got there in Power Query, but this thing chunked. It was never going to be successful in a production environment. I'm out there in the advanced editor, I'm writing all this code. I'm like, "Look at me. I can write M code and do it," but it's like it wasn't going to work.

And I came back around to a similar problem a few years later in Python, in PySpark, there are applications where whether it's volumes of data, the complexity of the preparation you need to perform to get to the right fact table where a tool like this in Spark is going to be better suited for the job.

It's hard to crystallize it in business value terms sometimes, but the most underrated benefit for me with Fabric is the opportunity for people to move into these other workloads. And you were mentioning, Chris, like AI, assisting you with learning Python and things like that, and the elimination of all these other barriers. I think it's easy to forget if I wanted to jump from Power BI over to something like Python, I was not only going to be in a completely different tool set, but I probably had some sort of infrastructural barrier now going on as well.

Chris Finlan (00:42:23):
That's a great point in terms of the fabric value proposition of what you had to do before. If you think about Databricks or Synapse Spark, you had to run an entirely different system. You had to go and manage your compute separately. You have to go and tie these things together and make it effortless.

And you saw just at Build this week, there's been a number of announcements around the stuff with Databricks being integrated more closely into the Fabric ecosystem, and now with Snowflake being integrated more closely, that ease of use or ease of adoption and just having a single platform to do all this I think that that to me is something I was candidly very skeptical that this would work or make sense.

Before I joined the team, I was like, "Really? They're going to go and try and put all that stuff together in Power BI?" It was called Fabric at the time.

Rob Collie (00:43:09):
Isn't that the best experience though, when you're skeptical of something and then it works?

Chris Finlan (00:43:13):

Rob Collie (00:43:13):
Some of my favorite experiences in life is having my skepticism dispelled.

Chris Finlan (00:43:18):
Yeah. When I asked this one customer that was really important to bring out this scruffy consultant from Indianapolis to go and do their data model.

Justin Mannhardt (00:43:28):
That's the first time I've heard Rob described as scruffy.

Chris Finlan (00:43:32):
At the time, he was.

Rob Collie (00:43:33):
I'm going to give you a Han Solo line. "Who's scruffy looking?"

And by the way, I was from Cleveland at the time.

Chris Finlan (00:43:41):
Yeah, that's true.

Rob Collie (00:43:41):
Even scruffier.

Chris Finlan (00:43:42):
That's true.

I think one of the hardest things for, and I'm sure you experienced this Rob in the product team, is because you see how the sausage is made, you see all the flaws when you go and you use the product and you're like, "Oh, God, why is it like this? Or why does it do this?" And I know why. I have answers to a number of things like why it does certain things.

And you have to remember this on the product side, and it's something that being in the field and being somebody, when I was at SAP before I joined Microsoft, who was the Microsoft guy running around using Microsoft tools, people bet their careers on the things we do on the product team. Truly, they're like, "This is my career to support my family and to give myself a good life based on what you guys are building."

There's a huge sense of responsibility that I feel when I go and say, "These people are making huge bets on the things that we're doing here. We should be taking this seriously and making really smart decisions around this stuff." And by bringing these things together, it truly is opening this up to a whole new audience.

At the same time, are they all going to move over? No, of course not. It is going to be a journey because for the longest time, I didn't even know what cube formulas were, it took years of me beating my head against the wall trying to do other stuff when it was like, "Oh, there's actually this really great way to do this in this existing product that I already have access to. Hey, let me go try it there."

And I think that's oftentimes what you'll find is you'll try and use your existing tool set up until you're like, "Well, this doesn't work anymore." This was always the joke about access is like you use Excel until you can't. Then you put it in an Access database and you use Access that you can and then go into a SQL database.

Justin Mannhardt (00:45:29):
That sounds familiar.

Chris Finlan (00:45:31):
Yeah, that was my journey.

Rob Collie (00:45:34):
It's like career failover.

Chris Finlan (00:45:36):
Yeah, no, I mean it was just... And that's exactly with the Spark stuff to your point...

You're basically echoing what Marco Russo I think wrote in a blog post not that long ago, like, "Hey, why would I go and put my stuff in the lake versus using it in just an imported model."

And for certain use cases, yeah, it's fine to just leave this stuff as it is, but as your use cases get bigger and you're talking about truly big data and the ability to go and just have that data readily available in the lake near real time to go and be able to see that without having to worry about the data set refreshes that you'd have to go and manage and just making it easily available across the organization, that's an incredibly powerful story. That takes time.

One of the things I often forget is... We just went to GA in November. Even though I've been working on this for what? Two and a half years? The product hasn't been available in a GA state since November. And again, as a pm, I still see all the things that's missing because that's my backlog.

I truly see customers on the verge of going big with this across the board in terms of just seeing their excitement, seeing their use cases, seeing how they can go and bring these things together, and it's not just our workload, but all the workloads. Think about the flexibility that you then have. You can do it because it's all right there.

You don't have to go like, "Oh, I've got to go figure out how I go and bring in a data warehouse to this environment." Guess what? It's just there. It's just part of your Fabric experience. The stuff with data activator. I have this available to me now. Wow, this is amazing.

I don't realize sometimes all the different things I get with Office now because I'll go into that waffle and I'm like, "Oh my God, I forgot I even had access to any number of these things in terms of what I can go and use for this."

Microsoft Loop now. I see people using all the time. Who-

Justin Mannhardt (00:47:22):
Oh, yeah.

Rob Collie (00:47:23):
Justin loves him some Loop.

Justin Mannhardt (00:47:24):
I love the loop,

Chris Finlan (00:47:25):
And I'll tell you what, I was really skeptical of that at the time because it's just a natural of my job.

Rob Collie (00:47:30):
I'm still skeptical.

Chris Finlan (00:47:31):
What a surprise.

Rob Collie (00:47:32):
Justin sends me a loop, I'm, "WTF is this? What is this?"

Justin Mannhardt (00:47:37):
I think we have a shared affinity for OneNote as being a good product, right?

Rob Collie (00:47:41):
Yes, but less so over time. Ever since they bifurcated that product.

Justin Mannhardt (00:47:45):

Rob Collie (00:47:45):
Now I don't know which ones which. I've switched to PowerPoint in OneDrive mode. OneDrive, PowerPoint is like... That's pretty much...

Justin Mannhardt (00:47:54):
That's all Rob needs.

Rob Collie (00:47:56):
Yeah. I'm not an aggressive early adopter of new tech, which makes the whole story about me being an aggressive early adopter of Power Pivot and DAX and everything-

Chris Finlan (00:48:06):
I was about to say-

Rob Collie (00:48:09):
... beautifully ironic, right?

I would start many presentations by saying, "Look, folks, I don't believe in anything. I don't trust in any software. I don't really like software, but I love this stuff," as a means of getting their attention pressing the bull button.

Chris Finlan (00:48:22):
That's a fair point. And I am. I'm the person who will go and want to be one of the first people to go and adopt stuff. It drives my wife absolutely bananas because she just wants stuff to work.

Rob Collie (00:48:32):
Oh, yeah, you've got home automation now and you can't even open the doors. The closet door won't open because it's digitally locked.

Chris Finlan (00:48:38):
I eventually learned my lesson with that. Even for me, it was getting to be too much and I backed off.

Rob Collie (00:48:45):
"Honey, the microwave won't work again." Like, "well, you need to get the panel and press this on the touchscreen."

Chris Finlan (00:48:51):
I think the turning point for me was when I set up Alexa to turn on and off the TV, and I was all proud of myself, and we never used it. I was like, "This was so stupid. Why did I bother spending all that time to figure out how to create this recipe to do this when I never use the damn thing?"

To the point around, you need to be solving a business problem for customers. Just because you create something cool doesn't mean that they'll immediately adopt it.

Justin Mannhardt (00:49:16):

Chris Finlan (00:49:16):
Microsoft in the past has fallen into this trap. I'm sure every software company has certainly done so. It's like, "I have this really cool thing that's searching for a business problem," as opposed to, "There's a business problem. Let's go create a really cool thing to solve it."

Rob Collie (00:49:30):
I want to back us up for a moment. One of our favorite games here is we introduce new segments on the show, and then we only do them once.

Time for everybody's favorite segment, deliberately naïve statement.

Chris Finlan (00:49:45):
This is a new segment. Really?

Rob Collie (00:49:47):
Well, it's the first time we branded it, as...

Chris Finlan (00:49:50):
I see. Okay.

Sponsored by Kellogg.

Rob Collie (00:49:53):
Right. Exactly.

So here's the deliberately naïve statement for reaction from the two of you. Okay, so Spark is just another ETL tool. It's just like Power Query. It's just like SSIS.

Chris Finlan (00:50:06):
It's much more than that. It allows you to accomplish ETL.

Rob Collie (00:50:10):
Okay. That's a very Zen master...

Justin Mannhardt (00:50:12):
I can test my knowledge with the master of ceremonies because I think we talked about this on one of our early pods about Fabric. Spark fundamentally is a compute engine. So on said compute engine, I can do things like ETL. I can design, orchestrate, and manage machine learning models. It is a notebook code-based experience, first and foremost, but it's not strictly ETL.

Chris Finlan (00:50:38):
Yeah. You just ignored all the data science stuff I brought up.

Justin Mannhardt (00:50:41):

Rob Collie (00:50:41):
Can it run any code? Can I slap some C# in there?

Justin Mannhardt (00:50:44):
Why? Why would you do that?

Rob Collie (00:50:46):
Well, no, no, I know, but that's what I'm saying. I clearly can't run VBA on Spark. Right?

Chris Finlan (00:50:51):
Well, that's a huge shortcoming coming from the SSRS world. SSRS world which still runs VB.

Justin Mannhardt (00:50:58):
I do love me some VB.

Rob Collie (00:51:01):
I'm assuming that there's a specific lane that it operates in, and how do we define that lane?

Chris Finlan (00:51:06):
It supports PySpark, Scala, SparkR, used to support C#.

Rob Collie (00:51:12):
Okay, so let me continue with this naïve exploration because I think this is super valuable to me as a proxy for some of our listeners, I think it's going to be valuable for them as well.

So Spark is a compute engine for running certain kinds of code. I might, for example, use Spark to run some Python, let's say, to perform some what would think of as ETL to park some data in one lake and then subsequently use Spark to spin up some machine learning against the data that's been parked there.

Chris Finlan (00:51:43):
You'd use a programming language or you use a language, but yes.

Rob Collie (00:51:47):

Chris Finlan (00:51:47):
Spark is the underlying engine.

Rob Collie (00:51:49):

Justin Mannhardt (00:51:49):
It'd be like saying, "I'm going to run this VertiPaq measure."

Chris Finlan (00:51:54):

Rob Collie (00:51:54):
Okay. Okay. No, no. So look, look, look.

Chris Finlan (00:51:57):
That would sound a lot cooler.

Rob Collie (00:51:58):
All we're highlighting here is that I didn't have the confidence to pick the language to talk about the machine learning thing. I perform a useful role around here with my deliberate and completely authentic naiveté around these things.

Chris Finlan (00:52:13):
To be perfectly fair, Rob, I was you in late 2021.

Rob Collie (00:52:17):

Chris Finlan (00:52:18):
In terms of my exposure to this world, I would expect the vast majority of people who are listening to this are coming from the Power BI world.

I was at FabCon a few weeks ago, and it was interesting just how early in the journey these folks are, not just from Fabric but from Power BI. We talked about paginated reports several times. Paginated reports was like, "Whoa, whoa, whoa. What is this crazy new technology here with..." Which of course I know you, Rob, would love.

Rob Collie (00:52:44):
What is old is new again.

Chris Finlan (00:52:45):
It's like, "What is this MDX I have to write in this?"

And so that was a good reminder of you and I've often talked about the Redmond bubble and getting outside of that and really connecting with people again. Sure, you have the customer calls, but it's definitely different when you're talking about...

These are oftentimes small consulting firms or people who are working in local state governments and they don't have access to latest and greatest. They don't have access to go and just spin up a new Power BI subscription or an Azure subscription to go do something in Fabric. There are limitations to just going and getting budget to go do things like this. So it's something where you have conversations.

I remember one guy was asking me about, "I didn't even know you could get Office through GoDaddy." This was his question. He's like, "I have a problem with my subscription with Office 365 through GoDaddy where I can't use Power BI." I had never seen that before and I've seen being in the field and a product. I'd never seen that issue before.

So just remembering how new this is to people. I think that's a fair point. Now, hopefully they're a little bit smarter than you, but it's a low bar.

Rob Collie (00:53:49):

Well, listen, speaking of which, I was really, really close to learning something there and you all were like, "Quick. Let's get him back off track before he learns something. Let's close the loop."

So Spark is a compute engine that runs certain kinds of programming languages. I might use Spark to run a particular programming language to perform what I would think of as an ETL task, data transformation, and then park the results of that transformed data in one lake, and then spin up Spark again with a completely different, potentially completely different programming language to perform some sort of machine learning task against the data that had previously been so parked. Is that a valid picture?

Chris Finlan (00:54:33):
That is one possible use case.

One of the things you can do obviously is you can import different libraries to use.

Rob Collie (00:54:40):
Yes. And there are many good libraries.

Chris Finlan (00:54:41):
I know you're going to go in the direction of a library joke.

Rob Collie (00:54:43):
We have extensive files, right? The Terminator and T2.

Chris Finlan (00:54:47):
I don't mean to get off topic, but I do want to bring this up.

So my son now is 15, so I've been introducing him to the Arnold Schwarzenegger films and we started with Running Man. We started Running Man, then Commando. Now we're moving to Terminator.

He liked Running Man quite a bit. He thought Commando... He laughed at it quite a bit.

Rob Collie (00:55:01):
It's pretty over the top.

Chris Finlan (00:55:02):
I love Commando because it is over the top, but again, I'm a product of that era as opposed to my son. He really liked Running Man.

Rob Collie (00:55:08):
So you'll get to the only really truly good movie in all of that, which is Predator.

Justin Mannhardt (00:55:13):
I thought we were going to say Last Action Hero.

Rob Collie (00:55:15):
Oh, no, no, no. Last Action Hero is-

Chris Finlan (00:55:16):
Stop. Come on.

Rob Collie (00:55:17):
Last Action Hero is literally the peak.

Chris Finlan (00:55:20):
Please stop both of you. I hate this revisionist history around Last Action Hero. It is garbage.

Rob Collie (00:55:26):
No, it is not.

Chris Finlan (00:55:27):
How dare you.

Rob Collie (00:55:27):
I loved it.

Chris Finlan (00:55:28):
Of course you did.

Rob Collie (00:55:28):
I owned it on VHS.

Chris Finlan (00:55:30):
Of course you did.

Rob Collie (00:55:32):
Talk about a hipster statement. I owned it on tape.

Chris Finlan (00:55:35):
Oh my God. I knew that was going to take us off-topic.

It's interesting in terms of the libraries. I'm sure you're familiar with mim on Twitter who's now a Microsoft employee in fact.

And one of the things is he talks about he loves DuckDB and he'll go and import the library to use DuckDB, and he runs a Spark session to basically go and use that in his notebook.

And Spark is very, very powerful and he'll actually argue to go and just do things with Python or some of these other things.

I don't really need Spark necessarily to do some of this stuff when I'm not working with massive amounts of data. It could be seen as overkill. And-

Rob Collie (00:56:08):
Why would it be overkill? Why would I use Python without Spark versus with Spark?

Chris Finlan (00:56:12):
A lot of it's cost. You have to remember.

So again, going back to why I was originally brought over here, one of the things that Synapse Spark, you need to have at least three nodes. So there's different size machines you can use.

Rob Collie (00:56:22):
For the listener's benefit, either of you can answer this question. If I can use Spark as a compute engine to run something like Python, it's providing me some value in certain circumstances that programming language, Python, for example, exists independent of Spark. It can be run in many different places.

Chris Finlan (00:56:39):
And I think we were talking about it and moving from Hadoop, the distributed computing model, to do truly large amounts of data.

I think a lot of times people in the Power BI world, and you and I would have conversations with customers, so they talk about big data needs and it'd be 10 million records. That's not really what we're talking about when we're talking about the amount of data.

You think about things you do online like Amazon or eBay or something like that. The amount of data that they have behind the scenes for a large distributed model for compute, for cheap compute, because again the idea is that...

The whole thing with Hadoop is you could use legacy hardware. You didn't need to buy these super expensive machines to go and do that.

Rob Collie (00:57:20):
If I'm dealing with modest amounts of data, I can spin up a VM, virtual machine, or something and I can run some Python code on there to handle my, let's say ETL need. Right?

Chris Finlan (00:57:34):
Well, you still need an engine under it.

Rob Collie (00:57:35):
I'm sure there's a million runtimes for Python.

Justin Mannhardt (00:57:38):
I don't know that there's a million.

Chris Finlan (00:57:39):
Yeah. Okay.

Justin Mannhardt (00:57:40):

Rob Collie (00:57:43):
A hundred? It's like how many zeros are on it, right? There's only one runtime in the world?

Chris Finlan (00:57:48):
No, no. With Spark, there are runtimes and there's usually four that are at any particular time supported. It's versions of SQL. So you have a sports cycle timeframe, but it's usually every couple years that that rolls off. So there's 2-4, 3-1, 3-2, 3-3.

Rob Collie (00:58:03):
I make no secret of the fact that I've never written any Python code.

Chris Finlan (00:58:07):
Don't conflate Python with Spark.

Rob Collie (00:58:09):
That's exactly what I'm trying to do, is I'm trying to tease apart when I would use Spark to run some Python code, for example, versus not use Spark.

Chris Finlan (00:58:17):
This is the argument that some would make is that Spark, I'm not trying to run big data analytics. I don't need to run this across hundreds of nodes.

Rob Collie (00:58:27):
Right. That's what I was getting at.

Chris Finlan (00:58:28):
But I want them to run across hundreds of nodes. That helps me.

Rob Collie (00:58:32):
It helps you, Chris Finlan?

Chris Finlan (00:58:34):
Yeah, because then they're using my product.

Rob Collie (00:58:35):
Yes, I get it. Okay.

Chris Finlan (00:58:37):
It's a very selfish viewpoint, I'll admit.

Rob Collie (00:58:39):
Well, yeah, but here's the thing, because I want it to run in some reasonable period of time. I can't have one computer sitting there chewing on this for years or whatever. Right?

Chris Finlan (00:58:50):

Rob Collie (00:58:50):
I need to run a bunch of things in parallel.

Chris Finlan (00:58:52):

Rob Collie (00:58:53):
I need to scale it up, but only every now and then. It's not always running all the time.

Chris Finlan (00:58:57):

Rob Collie (00:58:58):
That's where the whole serverless thing comes in, right?

Chris Finlan (00:59:00):

Rob Collie (00:59:01):
Is the idea that I can have a workload that can scale up rapidly and grab all kinds of compute resources. And Spark is a framework that helps me do that.

Chris Finlan (00:59:12):

Rob Collie (00:59:12):
Believe it or not, that is something important. I needed to know that.

Chris Finlan (00:59:16):
I feel like you already knew that and you were just leading us there.

Rob Collie (00:59:19):
Yes, yes, absolutely. I knew all of that, and I'm just trying to play dumb. Seriously. That was really valuable to me.

Chris Finlan (00:59:24):
I can't tell with you.

Rob Collie (00:59:28):
Wow. That explains a lot. That explains...

Chris Finlan (00:59:34):
I'm sorry.

Justin Mannhardt (00:59:35):
I have a question, Rob.

Rob Collie (00:59:36):
Let's do it.

Justin Mannhardt (00:59:36):
So Chris has shared his trials and tribulations of moving workloads from other places into the Power BI product. And so effectively, Chris is now responsible for one of the things that I could say came from the Azure Synapse world and now is existing in Fabric. There are other things.

Our team at times, they find themselves looking at Fabric versus Synapse and still trying to make decisions like what's the best thing for what our customer's trying to do.

Your backlog and what you're comfortable sharing, what are some of the new capabilities you're most excited about, whether it's bringing parity or going beyond what Synapse was able to do in the data engineering space?

Chris Finlan (01:00:18):
Well, the funny thing is I was literally just looking at our roadmap, which we published this week.

So one of the biggest things is being able to go and connect a fabricated warehouse from a notebook. This is something where you could do this in Synapse, but it was actually an open source project. It was maintained by Microsoft to go and do this and bring in your notebook, which is just, they stopped working on it. And so as the Spark versions continued to evolve or move to the next version, bringing support for that, something that didn't happen.

So this is something that we made available and announced this week as well.

Justin Mannhardt (01:00:50):

Chris Finlan (01:00:51):
But in terms of the read capabilities are there first and then the ability to write back is something that's coming in the next few months. So that's something that we're very excited about.

High concurrency is one. So high concurrency the ability to run and reuse your Spark compute with the same session. So it's really more of a cost saving thing than anything else because one of the things with Spark, you can run untrusted code. So this is, Rob, back to your question earlier.

Another interesting parallel between SSRS and Spark is the security things you have to think about. You have to have a security boundary because you can go run any code you want. You have all that flexibility and how do you protect that?

So in the paginated reports world, we use containers. Right now, with Spark, you get your own cluster of compute, literally VMs spin up for you and go and isolate it that way.

But the thing is, only one user can use that at a time. And one of the challenges was is that each session you would go and get new compute for that. Well, that could multiply very quickly in terms of like, "I'm using eight cores here, eight cores here, eight cores here," and then the ability to kind of go and reuse that compute. It's something called high concurrency, which Databricks has had, and we introduced for notebooks several months ago, and now it's going to be available for pipelines here in the very, very near future.

So again, it's a cost thing and a speed thing in terms of not waiting for new compute to fire up.

One of the reasons that we have starter pools is that normally when you go and connect to your cluster, in the past it would be a two to three-minute wait to go get the machines from Azure, and guess what? People don't like to wait.

Now, me coming from SSRS, I'm like, "Two to three minutes-"

Rob Collie (01:02:17):
"That's nothing."

Chris Finlan (01:02:19):
You go get some coffee.

People are way too anxious.

Now, Rob of course would say anything more than five seconds. You've lost the user.

Rob Collie (01:02:26):
Three seconds of now.

Chris Finlan (01:02:28):
Yeah. I think that's the point is having that quick experience. And people got addicted to it very quickly.

It also helps save... If you're thinking about a pipeline, you don't have to go keep spinning up new compute to do each stage. You can reuse the compute for that session. Those are a couple of the things that I'm pretty excited about.

One of the big announcements we had this week is this new Velox Gluten is the open-source technology that we're using, the native engine acceleration that we've introduced. This idea that we're making Spark run faster through software acceleration and basically, every query that you're doing, you're going to see increased performance with that. And so we announced that this week.

This was a monster effort for the team, put this into the engine, and using the open-source technology to do this as opposed to some sort of bald garden that's something natively ours. This is something that people should be very excited about. It was... Talk about briefly with Bill, but you'll hear much more about this in the coming weeks and months.

Rob Collie (01:03:20):
I have a curiosity.

Justin Mannhardt (01:03:22):

Rob Collie (01:03:22):
Very, very early in this conversation, you mentioned very briefly the-

Chris Finlan (01:03:26):
I don't remember at this point.

Rob Collie (01:03:28):
Well, it's okay. I wrote it down. External storage memory.

So you mentioned hardware deployments. From a perspective of being at Microsoft, there's no chance this is going to be truly relevant to the ways in which we use this stuff. How close do you ever get in your engineering role, like your product role, to talking about the types of hardware...

Chris Finlan (01:03:49):
Oh, I deal with that very specifically.

Rob Collie (01:03:51):

Chris Finlan (01:03:52):
One of the people on my team, a gentleman named Guy Haycock, is responsible for choosing the hardware that we use, making sure it's available in the Azure regions, making sure we have enough cores available everywhere. That is an entire process. He used to do it for both Synapse and Fabric and now just focuses on Fabric and another team handles the piece around Synapse, but I know exactly the VM types we use.

It's interesting because one of the differences between Synapse and Databricks that we carried over is with Databricks, you actually get to choose the type of VM you use. Now, some would argue like, "Oh, I love having the ability to choose all those VMs." Do you know how many VM types you can choose from? There's...

Justin Mannhardt (01:04:26):

Chris Finlan (01:04:27):
And so that can be very confusing. Again, some people love having all that choice, but oftentimes people are paralyzed like, "I don't know what I need to choose," and at least on the Fabric side, we don't expose that. We have the one VM type.

And in Synapse you had the ability to choose CPU or GPU, so there was a public preview of GPU. I don't think it's a secret that we want to bring that capability to use GPUs based hardware there as well, especially the data science scenarios. That provides a lot of additional value.

The world's changed a bit in the last few years where GPUs are suddenly very difficult to get. I spend a lot of time looking at that.

Rob Collie (01:05:00):
Poor Sam Altman panhandling for his trillions of dollars.

Brother, can you spare a trillion?

Justin Mannhardt (01:05:07):
Need more chips.

Rob Collie (01:05:08):
Need more compute.

Chris Finlan (01:05:09):
Back to your point around the hardware, it's not just the hardware itself, it's all the networking pieces, all the things like the cache that you need to make available all feeds into the underlying cost of goods sold. That's something that I've had to become very intimately familiar with.

Rob Collie (01:05:24):
Have you ever gone to or known of any place at Microsoft where you go to a hardware vendor and say, "Hey, we need something special cooked up?"

Chris Finlan (01:05:33):
I'm sure people do it.

Rob Collie (01:05:35):
I don't know why that would be so cool, but one of the coolest things I've ever associated with was when we were running our own cloud servers way back in the day, like 2011.

Chris Finlan (01:05:44):
When you were COO of a cloud?

Rob Collie (01:05:46):
That's right. We built our own rack-mounted hardware, but with PC gaming rig motherboards essentially. These things were overclocked because the normal server hardware didn't want to run each core super scream and fast. It just wanted to be slow and steady and distributed, but we needed each core to be screaming fast to run that DAX language that you love so much.

Chris Finlan (01:06:10):
Marco Russo one day is going to knock at my door and just whack me. Mafia analogy.

Rob Collie (01:06:17):
The thing about being Marco Russo is that he doesn't have to.

Chris Finlan (01:06:21):
That's actually very true.

He's one of the smartest people I have ever met, and the fact that I got him to speak to a stuffed bear for 25 minutes is one of the true joys.

Rob Collie (01:06:30):
Maybe he does have to kill you.

Always a pleasure to see you. Always a pleasure to talk to you. This has been a lot of fun.

Chris Finlan (01:06:37):
It has.

Speaker 3 (01:06:38):
Thanks for listening to the Raw Data by P3 Adaptive podcast. Let the experts at P3 Adaptive help your business. Just go to Have a data day.

Check out other popular episodes

Get in touch with a P3 team member

  • Hidden
  • Hidden
  • This field is for validation purposes and should be left unchanged.

Subscribe on your favorite platform.