Chris D’Agostino, Databricks | Google Cloud Next ’24

[Savannah Peterson]

Welcome to Google Cloud Next

Good afternoon, AI fans, and welcome back to Google Cloud Next here in fabulous Las Vegas, Nevada. My name is Savannah Peterson. It’s day three here on theCUBE. We are over 30 segments in to our fantastic programming across three days. John, my co-host and founder, co-founder of theCUBE. What a week for us.

[John]

AI is in everything

That’s been great. I mean, first of all, we’re AI drunk all the time on theCUBE, but this is like big time AI. It’s in everything they’re talking about. It’s got all the winning formulas for theCUBE, speeds and feeds, high-scale performance, embedded AI, and end-user experience is just awesome. Great event, and this next guest is with a really leading company. Still private, but doing amazing open things in AI and all the data, so it should be a great segment.

[Savannah Peterson]

Introducing Chris D’Agostino

Yeah, absolutely. Let’s welcome Chris to the show. Chris, thank you so much for being here. Thanks for having me. Has the week been as exciting for you as it has been for us?

[Chris D’Agostino]

It’s been fantastic. This is my first Google Next event personally, so it’s been a lot of fun.

[00:01:04]

Yeah.

Google partnership

Did anything … What stood out to you about the experience so far? I just think the partnership with Google for us has been amazing, and this year in particular, it’s really accelerated, and so talking with a lot of the teams from Google and having as much traffic at our booth and trying to better understand how our platform integrates into Google has been great, and we’ve been very fortunate. We’re the Google Technology Partner of the Year, which we’re very proud of, and we appreciate the partnership.

[John]

Google Technology Partner of the Year

Yeah, so we’re very thrilled. That’s a great accolades from Google, obviously Partner of the Year, but Thomas Kurian in his keynote day one was really kind of laid it out. With a cloud that has first-party LLM, third-party LLM, and open source, you guys have introduced a new LLM, DBRX. Okay, got a lot of awesome reviews, number one on Hugging Face. People really liked what’s going on performance-wise, so open source models converging in an adoption and speed and performance with the pre-existing proprietary models is quite an accomplishment.

[00:02:07]

Shows growth and appetite for AI from developers.

DBRX convergence

How’s this converging into the execution for companies that are evaluating it? Because Data Lake has been a great strategy, get all the data in there. Now developers are coming in, data as code, we’ve been covering that. Here, this seems like the nexus of where AI and the data is everywhere, and AI is everywhere. Where’s the customer’s intersection point? Where does this come together for the customer?

[Chris D’Agostino]

DBRX strategy

Yeah, I mean, so this, I think DBRX is kind of the culmination of a decades-long strategy, and let me just kind of go back to the co-founders of the company, research students out of UC Berkeley, and when they did the initial investment thesis for Databricks, it was really three big bets, and if you go back 10 plus years, these were pretty bold bets at the time. The first was that organizations were going to be moving to the cloud, second, successful ones were going to be using open source, and then third was that machine learning would be really at the forefront, and so now you fast forward an entire decade.

[00:03:10]

[Savannah Peterson]

Nice predictions.

[Chris D’Agostino]

Yeah, very wise predictions. I think it’s one of the many reasons why the company has had the success it’s had, but you fast forward 10 years, and we acquired MosaicML, and there’s a lot of press around the price point around that acquisition, and DBRX, I think, is the first step in sort of the results of that, which is, can you build open models that perform as well as the proprietary ones, and enable customers to do more with their own data in a very secure manner to either fine-tune or do rag-style augmentation for Gen AI, and we’re really proud of the results. As you said, it’s the number one open model on Hugging Face. I think we’re seeing new models coming out all the time.

[00:04:00]

We’ve got, through the investment in MosaicML, we’ve got an amazing research team, and we’re going to continue to innovate in that space, for sure.

[John]

Business changes at DBRX

How has your business changed at DBRX just over the past couple years? I mean, we’ve been following, I want to get kind of a scoped perspective, pre-Gen AI, you guys had all the things going on, all those bets were coming in, but then the Gen AI kicks things up a high gear, and now they’ve got the language model. How’s that changed at DBRX’s focus and execution, and ultimately customer benefit?

[Chris D’Agostino]

Gen AI and DBRX strategy

Well, I think it’s actually played into the strategy really well, right? The overall architecture, and when you look at us versus some of the other vendors in the data space, DBRX was built to be a data analytics platform with machine learning at the forefront, and so we have a world-class SQL engine that allows and enables data warehousing, but the core of it has been, how do you get all the different data personas inside an organization from data engineering, through data analysts, data scientists, machine learning

[00:05:01]

experts, how do you get them to work together as a team, and view one copy of the data, and try and leverage the upstream work by other teams? So you mentioned the data lake, and the data lakes had their pros and cons, and we’re proud of the fact that we pioneered this lake house concept, which was, hey, with all the improvements from the hyperscalers, in terms of storage, and network IO, and things like that, and the ability to spin up containers quickly, we can actually leverage a single copy of the data, store it in the object store, and run all these different use cases to include

[John]

AI factory

large language models. Yeah, you know, Chris, one of the things that Jen sent over at the NGDC conference, said he used the AI factory word, I remember saying at your event, you’re enabling the AI factory model with the data lake, and now that word’s come out a lot, the AI factory, take a look at the metaphor, things are being built with AI, and optimized for AI, what does that mean to you, how do you look at this AI factory positioning, how do you frame that?

[00:06:04]

[Chris D’Agostino]

Leveraging AI safely

So organizations are really struggling with how do we leverage it, we know it’s a great opportunity, but how do you do it safely? And first things first is, you’ve got to separate, I think, the use cases from externally facing use cases to internally facing use cases, and for large enterprises especially, where they’ve got a large workforce that can provide that human feedback, because the worst thing you want to do is put an externally facing LLM out there, have it generate some content that’s not accurate, and we’ve got plenty of examples mentioned in industry where that’s happened and it’s created some legal challenges, what we’re seeing is a lot of organizations starting with internally facing use cases, building off of these open models, and Databricks is designed to be able to be very flexible, you could run our model, you could build your own model if you wanted to, and the Mosaic engine makes that really affordable, or you

[00:07:00]

could leverage any of these proprietary third party models through a model gateway that we’ve enabled, so for us, we want organizations to feel like they’ve got choice, and not locked into any particular solution, because we know it’s changing, and we want them to be able to orchestrate the invocation of models that meet the specific needs of a use case.

[John]

Open formats and interoperability

You know, one of the things I’ve been impressed with Databricks on, especially last year’s event you had, was you introduced open formats, kind of changes the game, and interoperability is going to be a big part of it, but also the intelligence is seeing more reasoning coming in with the AI side, so I have to ask you, are we at the peak or at the beginning of true democratization of data, because you combine open source, open formats, and scale and intelligence and data, are we at the peak or are we at the beginning, or where are we in the progress bar towards, I mean, true democratization?

[Chris D’Agostino]

Democratization of data

Well I think, you know, my personal opinion is, if you look industry wide, I think we’re kind of at the beginning, and the reason for it is because many organizations, these

[00:08:03]

large enterprises in particular, are coming off of these proprietary systems, kind of the vendor lock-in, and I think the technology community has really gotten sort of the wake-up call around, look, you’ve got to be more open. 100%. Now, Databricks, I would say, is at the peak. We’ve been open from the very beginning. The foundational elements of our platform are very popular open source projects, Spark being sort of the key one there, but for us, we look at it from two dimensions.

Openness and performance

One, we want to be as open as possible, and we think the entire ecosystem should be more open. The second is, we want to be the best price per performance for a particular use case, So when it comes to, you mentioned like the open formats for table structures in the lake house environment, you know, we have Delta, which is fully open source. We stand by it, especially because of the performance characteristics for it, but we saw in industry, you know, Iceberg and Hoodie and all these other alternatives, and so rather than just stay fixed in this, we said, you know what, every time we get new data in, Delta will be the default, but we write out the additional metadata for both Hoodie and Iceberg.

[00:09:14]

So if you’re a customer that’s got a large Iceberg kind of environment, Databricks will fit into that perfectly.

[John]

Importance of open formats

Well, explain why the importance of that, I brought that up first, I want to unpack that because I think that is a real nuanced point. Why is that important? Because that enables something, that enables value in the form of, what, interoperability, ease of use, integration, most barriers, what’s that?

[Chris D’Agostino]

Yeah, absolutely. Interoperability. We don’t want customers to have to think like, okay, if I’m using Databricks and I have some third-party system in my environment that only reads Iceberg in terms of their capabilities, why limit them in terms of that integration? We would rather solve that and say, look, we think you’re going to get the best performance out of the Delta format, but under the covers, all of this is just Apache Parquet, and, you know, the differences are metadata and the way in which, you know, certain, you know…

[00:10:05]

[John]

It’s a bold move because you’re basically taking a preemptive strike and saying, let’s not squabble, let’s get open, and everything else advances, so that will accelerate democratization.

[Savannah Peterson]

Trends in AI adoption

Yeah, I mean, it’s stronger together, between the partners, between the open source community as a whole, it’s the whole shebang. You touch a lot of different customers across verticals, what are some of the, I mean, outside of people getting excited and not necessarily having their full strategy iced out, a lot of POCs we’re seeing, MVPs rather than actualization, what are some of the trends you’re seeing?

[Chris D’Agostino]

Internal AI use cases

So, again, in that sort of mode of looking internally first at opportunities, you’re seeing LLMs being used quite a bit in call center operations, and to make those more efficient and, you know, assist, you know, on the one hand, people say, well, AI is going to replace jobs and replace humans in jobs. This is really more of an augmentation play, this is really making those call center agents more effective at their job, providing better customer experiences, so we’re doing that.

[00:11:04]

Another one, which is really interesting for us, is the ability to take LLMs, train them on COBOL as a programming language, and the ability to convert that to other programming languages, so it gives organizations the ability to take legacy code and have a machine do a lot of the conversion for them, have humans validate it, improve it. What a time saver. But, yeah, think about the cost savings associated with migrating legacy platforms.

[Savannah Peterson]

And it’s everyone’s favorite part of their job, it’s migrating legacy. Sure.

[Chris D’Agostino]

Reading someone else’s old code with no comments and no documentation.

[John]

You know, data is so exciting, we love data, you know, it’s always fun.

Governance and generative AI

But an area we’ve been checking out, and I want to get your thoughts on this, is that it’s not always in the mainstream news, and when you peel back the onion on earnings and even in the private companies, the hot area right now is governance. And can you share your vision on why governance is important more now than ever before with generative AI?

[00:12:02]

What is it about generative AI that makes governance important to nail down now, and how is it being crafted or architected?

[Chris D’Agostino]

Importance of data governance

Yeah, so I’m going to sort of call back to Google’s white paper on the statistical-based models for AI, and they had a white paper called Unreasonable Effectiveness of Data, right? Which was this notion of a less sophisticated model trained with a ton of data is going to outperform a more sophisticated model with less data, right? When you get into neural nets and LLMs especially, the quality of that data is so critical because you want it to be trained with things that you know will provide the correct answers that your organization is willing to stand by. And so that risk, you know, hallucinations providing the wrong answer, those things all will stem from the data that you’ve used to train the model. So governance of data, understanding where it’s come in, the lineage associated with it, the data quality checks, all those things are so important and that’s why we’ve built inside of our platform lots of capabilities around that.

[00:13:06]

It all surfaces up through Unity Catalog, which is that single pane of glass to really understand what different data assets you’ve got and what is the quality of the data.

[John]

And also you could foreclose value if you don’t, if data is not available for governance reasons or whatever, or bad governance if data is missing.

Best practices for governance

So it begs the question, is there a best practice that you guys see now for developers who want to do the hard stuff with governance and compliance? The performance stuff’s going to all get taken care of, but the compliance side of it has always been kind of, I call it not anti-innovation, but slows things down a lot in the speed game. How do you see that speeding up on the compliance side so people don’t get held back?

[Chris D’Agostino]

Speeding up compliance

Yeah, well for us, we’re trying to leverage technology to help speed it up. So in addition to creating an environment where people can build and train their own LLMs and other types of machine learning models, we are embedding machine learning inside of the Databricks platform.

[00:14:03]

So as new data comes in, we’ve trained models on the existing data, and as that new data arrives, we’re able to compare and contrast whether or not that new data is consistent. And so we’re able to flag these things on behalf of the enterprise, not just the developers, but maybe the audit and risk folks in, say, a banking scenario.

[Savannah Peterson]

So Databricks has obviously been good at predicting the future, if we’re here now. Two questions for you back-to-back, actually I’ll separate them.

Big moment for ML

Does it feel like a big moment for you now that the whole world’s jumped on the ML front and everyone wants to play with AI?

[Chris D’Agostino]

Databricks’ ground-up approach

Yeah, it’s, you know, Databricks, its history and kind of its roots have been more the ground up inside of an organization, finding those data science teams, the ones that are doing a lot of the hard work and figuring out how to make their lives easier and more enjoyable. And now with this massive wave of AI and the success of the company and kind of the vision, I think, behind the company to build this platform out and basically have it there at the ready when this wave is hit in a way that doesn’t require organizations to pass their data to some closed-off environment and, you know, run the security risk associated with that.

[00:15:19]

You can do all the things that you’d want to do with large language models and Gen AI inside the platform with your own data, your own security model.

[Savannah Peterson]

Future predictions for Databricks

That’s awesome. All right, last question for you, going to the future prediction side of things. When we interview you next time, what do you hope to be able to say then that you can’t say today?

[Chris D’Agostino]

Let’s see. That’s an interesting question. I think what we’ll see then is Databricks will be a household name the same way that Google and other major data providers are. And I think the world will know more about Databricks as a result of the work that’s being done.

[00:16:01]

[Savannah Peterson]

Databricks as a household name

Love that. We will take that soundbite when it happens and we really look forward to it. Thank you so much for joining us today.

[John]

And we’ll see you at Data and AI event coming up at Summit. You will.

[Savannah Peterson]

Yeah, sure. It’s going to be a whole party. We’ll keep the party going.

Thank you and goodbye

Thanks for being here with us, Sean. And thank all of you for tuning in to our live coverage here from Las Vegas, Nevada. We’re at Google Cloud Next. My name is Savannah Peterson. You’re watching theCUBE, the leading source for enterprise tech news.