Ep. 5 What do Multimodal AI and Smaller LMs Mean for Enterprises? | AI Insights and Innovation

[David Linthicum]

What is Multimodal AI and Smaller LMs?

So, what do multimodal AI and smaller language models mean for the enterprise? Let’s talk about it. Welcome to AI Insights and Innovation, your go-to podcast for the latest news, trends, and insights in artificial intelligence, including generative AI.

Introduction of the Speaker

I’m your host, David Linthicum, author, speaker, B. Liskeak, longtime AI systems architect and analyst at The Cube Research. Let’s get into the topic. Well, this kind of came around when I’ve gotten a lot of questions about the multimodal stuff as related to generative AI and other forms of AI in general. And it’s kind of interesting because in looking at the use of this particular technology set, we may be able to solve problems that enterprises have without engaging larger generative AI systems or building LLMs, which are going to be very significant and complex to build and also very expensive to train.

[00:01:05]

And so, in some cases, multimodal AI will be just fine for the purposes that you need to use it for as it’s embedded in a business application.

What is Multimodal AI?

Let’s talk about what it is. So, multimodal AI and smaller language models is really a trend in the field of AI. And it’s really about the ability to accept other formats other than text. It can take images, it can take video, it can take audio, and then understand and convert them and make sense of it within an AI model. And so, that’s very handy. So, it refers to a type of AI system that can process and integrate multiple types of data, hence multimodal, including text, images, audio, video, like I just mentioned.

Benefits of Multimodal AI

And it allows you to better understand what that content is and generate responses to the content. So, this allows you to perform complex tasks that involve more than one type of data type simultaneously.

[00:02:04]

And so, chances are, if you’re using one of those, I think, annoying apps, where you’re taking a picture of yourself and it’s going into some sort of an AI system and then a cartoon version of you is spit out on the other end, that’s an example of multimodality.

Examples of Multimodal AI in Action

In other words, you’re sending an image into an LLM system or into a generative AI system, and it’s processing the image, it’s able to see who you are, and it’s able to create another format or another representation of you in your image. And then also using a multimodality kind of interface, spit out another image on the other side. So, whether you know it or not, chances are, if you’ve used multimodal models as part of leveraging AI applications, and if you’re leveraging Chat GPT-3 in some of the latest releases of that technology, it’s able to accept things like videos and images, which is handy because sometimes we’re going to talk to our generative AI models through chatbots,

[00:03:03]

but other times we want to submit an image and have it process the image. For example, capabilities within Chat GPT, you can kind of write what your interface should look like for a particular application, take a picture of it, submit it to Chat GPT, and tell it to generate the HTML to create a screen around this particular user interface you’re trying to create. It’ll do that for you. So, we’re able to take something that’s a written image, something that’s handwritten drawing in terms of how a user interface should look like, able to take a photo of it, submit it to Chat GPT, in this case, using a multimodal system to absorb the image into the technology, and it’s able to spit out the HTML based on what it understands around the image.

Use Cases of Multimodal AI

So, lots of different use cases for this technology. Healthcare, medical imaging, the ability to look at, obviously, we deal with images like MRIs and radiology images, and you’re able to do diagnostic accuracy checks, telemedicine capability, education, interactive learning, such as enhancing educational platforms by combining text images and videos to create interactive lessons.

[00:04:16]

In other words, we’re reading, perhaps, even the expressions of the student who is looking at the particular lesson that’s being run and see how they’re responding to it, see if they’re at stress, where we need to redo them or slow things down, redo the lesson or slow things down, or something else can be adapted to. So, those kinds of things are very, very handy. Customer service, chatbots, and virtual assistants, we obviously know how to use those to provide more accurate and context-aware responses. In other words, instead of just having a chatbot talk to a user via the keyboard, we’re able to, perhaps, even submit images of a problem that we’re having, such as a plumbing misconfiguration, and it’s able to diagnose a problem just kind of via the image.

[00:05:05]

It’s an example of multimodal AI. Autonomous vehicles use them. Entertainment uses them. Security, threats, and surveillance. And if you have the kind of doorbell cam that’s able to recognize the image, it’ll tell you if it’s a familiar face. It’ll tell you if a package has been delivered. It will tell you what’s actually going on versus just showing you the video. That’s an example, or typically is going to be an example of multimodal AI.

Multimodal AI vs. Generative AI

So, the use cases really get into the versatility of this technology. And obviously, the common question is going to be, how does it relate to generative AI? The reality is they’re complementary, and you can use them by themselves, and you can use them together. And as we just mentioned, the use cases together, where we’re writing a diagram of the way user interface wants to work, and we’re submitting that to a multimodal AI system, obviously, it’s able to understand what it is, but it’s also able to pass that information off,

[00:06:02]

typically in changing the modalities, into a generative AI system that’s able to do more advanced processing, more advanced inferences on the particular image that you’re in there.

Multimodal AI for Enterprises

But the reason I’m mentioning it here is because in many cases, enterprises that are using AI technology or want to use AI technology may find that multimodal AI is just fine for the particular use case that they have without having to employ a generative AI system, without having to employ an LLM. You’ve got to remember, if we’re building LLMs, very expensive if we’re using our own data, very expensive if we’re using our own processors, very high-end systems, it takes a long time to train the generative AI models, and it can be overly expensive and overly complex based on the problems we’re looking to solve. And so, there are instances where enterprises are going to find that multimodal AI is going to operate just fine.

[00:07:02]

It does not require the same size of processors. It does not require, normally, a GPU, which are very expensive, as you know, and it’s able to process things faster, and it becomes a simpler form of AI that we can leverage for tactical use cases within particular business applications. The sky’s the limit. The ability to take a picture of a farmer’s field, to figure out the hydration of that field based on images that are coming in that are read by multimodal AI, and the ability to put that information directly into a watering plant, directly into John Deere tractors that are able to go off and cultivate a field.

Advantages of Multimodal AI

So, the necessity of this is a couple of things. Number one, there’s a GPU shortage out there, at least there was, and they’re also very expensive. So, what we’re trying to do is get away with doing particular tactical use cases for AI and doing so on the cheap.

[00:08:01]

In other words, we’re not necessarily having to engage with a large language model. We’re able to use a small language model in this case and do a very tactical use case of AI technology, in this case, multimodal AI technology. Also, if we’re processing these LLMs out on the cloud, has a tendency to also be very expensive. So, it’s able to optimize architectures where we’re just using what we need in terms of AI capabilities, and I think that’s something is not necessarily as well known as it should be.

Multimodal AI vs. Large Language Models

In other words, people want to build these huge honk and very expensive generative AI system, which is fine and dandy. It’s able to consume information using multimodality, which is what we’re talking about here, and able to produce information using multimodality images, videos, sing a song. We’ve all gone to chat GPT and had it write a song about us. That’s all well and good, but you’re going to pay for that.

[00:09:02]

So, there’s going to be a complexity. There’s going to be engineers involved. There’s going to be an architect involved. There’s going to be a project that has to go on for many months, perhaps sometimes many years, to build some of these very advanced systems. And I’m finding that when I’m looking at the use cases of these things, and I’m talking to individual enterprises, that they’re not necessarily indicated to move into an LLM. In other words, they’re not solving very complex problems into themselves. And in many cases, multimodal AI is going to work just fine.

Key Message: Multimodal AI for Tactical Use Cases

And really, that’s kind of the core message that I’m trying to provide here. So, in other words, we have this type of AI that is going to be able to be used with generative AI systems and not. And in many of the tactical use cases that I’m seeing for AI, certainly when you’re dealing with multimedia things, images, data, which is going to be a use case for lots of businesses, whether it’s monitoring factory floor automation and productivity, the list goes on in terms of the number of ways in which we can use this technology.

[00:10:09]

It may not require that we use generative AI. And by doing so, we’re solving the problem with the minimum viable solution, which is going to provide us with the most optimized way to solve that particular business problem that we’re looking to solve. And I think that’s really kind of what it’s all about.

Importance of Minimal AI Solutions

Our ability to leverage AI for a business is not having the goal of just building the most expensive and impressive system out there. And I’ve seen a lot of businesses seem to be working in that direction. But the ability to use a minimal amount of AI technology, in this case, the tactical use of AI, to solve particular business problems, which is what businesses are normally looking to solve. Most businesses are not going to build chat GPT and other LLMs out there. It’s just not going to be practical for them.

[00:11:01]

It’s too expensive, very complex. You have to keep very expensive people around, data scientists, AI engineers. But the tactical use of AI, such as multimodal systems, is going to be a more pragmatic approach.

Tools for Building Multimodal AI Systems

So where do you find the tools to build these things? Well, the good news is it’s the same stuff that we’re using to build generative AI systems. So in other words, everybody out there, most of the tools out there that can assist you in building generative AI systems, and I understand there’s a big complex tool stack that comes along with the ride, can be used for creating multimodal AI systems as well. So we don’t have to retrain our developers on different tool sets, even though I’m sure there are tool sets that are approaching this problem. Specifically, normally, if you understand how to work and build generative AI models, LLMs, you’re also able to build small language models and leverage multimodal AI as a specific component of that particular tool set. So that’s the good news.

Takeaway: Focus on Business Requirements

So we’re leveraging the same tools.

[00:12:00]

We’re leveraging the same cloud services to build and deploy these systems. So what you need to take away from here is that utilization of this technology is not about building the most complex and expensive systems using AI. And it’s about meeting the minimum business requirements using whatever technology is needed, in this case, providing tactical benefits that lead to strategic benefits. So in other words, we’re not trying to over-engineer something or make it overly complex. We’re just going to step up and use a particular type of AI technology, in this case, multimodal, to solve particular tactical business problems, which we’re going to find is going to happen a lot of the time.

Practical Applications of Multimodal AI

So the ability to leverage this technology and tackle use cases that are attached to business processes and the ability to deploy these things as bits and pieces of applications out there, the ability to also

[00:13:00]

automate things that are not automated today, the ability to read invoices, the ability to look at the productivity of a factory floor, the ability to look at the productivity of people in general, and the ability to look at images in different ways is going to be a pragmatic use case for this technology.

Conclusion: Pragmatic Approach to AI

And businesses need to keep that in mind. That’s kind of the core message here. Well, anyway, don’t forget to like and subscribe. Also, comment below. Let me know what you think, and let me know what you want us to cover here on theCube. It’s great being here. You guys be safe. Cheers.