Skip to main content

Multimodality

Introduction

Multimodality refers to the use of multiple modes or methods to convey information such as text, images, audio, and video, to create a more effective and engaging way of transmitting information.

Multimodality in AI DIAL

AI DIAL taps into this by connecting to Large Language Models (LLMs) that handle various media types. You can create applications to handle specific modality tasks or even comprehensive solution (orchestrators) to blend together applications for more complex scenarios.

Working with different types of media is made available by supporting working with files. User or application in AI DIAL, can input and output files that are saved in a dedicated bucket and are accessible based on a flexible permissions model. Files can be provided as an input for multimodal models and generated by them as an output.

Models

AI DIAL Chat application offers user interface for communication with the Supported Models.

Connection to LLMs is realized using so-called adapters. Refer to OpenAI, Bedrock, Vertex adapters to learn more about them and the supported models. You can use DIAL SDK to create custom model adapters.

AI DIAL has adapters to a variety of text-to-text processing LLMs. Refer to Supported Models to view the list of supported models.

Regarding working with images:

  • For image-to-text tasks, AI DIAL has adapters for GPT4-Vision, Claude 3 and Gemini Pro Vision models.
  • For text-to-image tasks, AI DIAL has adapters for DALL-E-3, Google Imagen and Stability diffusion models.

For audio/video-to-text tasks, AI DIAL has adapters for Gemini 1.5 Pro and Flash and Gemini 1.0 Pro Vision. Refer to Vertex Adapter to view all supported models.

Applications

You can use DIAL SDK to create custom applications, which are basically any custom logic with a conversation interface packaged as a ready-to-use solution. Refer to Tutorials to learn how to create a simple application or watch a demo video.

Such application can be designed and configured to use multimodal LLMs to perform specific tasks or even form an ecosystem of applications that can interact with each other.

In the Cookbook section, you can find several examples:

Orchestrator

Besides creating an application which solves a specific multimodal task, you could create a generic application which is aware of multimodal DIAL models and is able to use them as tools to solve a given task. We call such generic applications orchestrators.

DIAL ChatHub is an example of an orchestrator that combines several applications and models into one unified access point. ChatHub can automatically route prompts to one of several agents (text-to-text applications, text-to-image applications, vision-to-text applications) depending on the task that needs to be performed. For example, if a user asks about weather, the Web RAG agent is engaged, if a user wants to output an image based on the text input - a specific application handles this task that is connected with a corresponding model. All this is done while interacting with one ChatHub solution.