
How will multimodal AI transform the accounting industry?
May 24, 2024

Multimodal is the new norm
Recent AI advancements have made headlines with the introduction of Gemini 1.5 and GPT-4o. These updates highlight a significant trend: multimodal capabilities are becoming the new standard in AI.
Gemini 1.5, Google’s flagship AI foundation model, was released in February this year with two standout features:
Multimodality across text and imagesA context window of over 1 million tokens
This allows the AI to complete some pretty impressive tasks, such as identifying a specific scene in a movie from a hand-drawn image. In the past two weeks, Google announced an update to Gemini 1.5 pro which will introduce audio as a new modality. Currently, audio needs to be transcribed into text (speech-to-text) for the AI to process it. However, when the new update goes public, Gemini will be able to process audio natively and concurrently alongside text and images.
GPT-4o was released a couple of weeks ago by OpenAI with significant improvements in multimodal capability. First, GPT-4o will soon be able to process audio in addition to images and text (much like Gemini) and will have more efficient image-processing capabilities to better understand and generate responses based on visual inputs. More importantly, the AI also has better contextual understanding when dealing with multimodal inputs. For example, it can integrate and correlate information from text, images and audio to provide more coherent and contextually accurate responses. Here’s one of my favourite examples — an AI maths tutor:
Oh, and the GPT-4o API is 50% cheaper than its predecessor, GPT-4. A cheaper API means cheaper AI products for everyone, which is welcome news.
What does this mean for accountants?
“Multimodal” refers to AI’s ability to handle multiple types of data simultaneously. Traditional AI models focus mainly on text, but real-world applications often need a mix of text and images (and sometimes even audio).
This is especially true in the accounting industry. Think about the variety of data formats that get inputted into ledger software: Images, PDFs, and CSVs, just to name a few. More often than not, accountants are faced with a combination of the above, such as an email with an image attachment and text body saying, “This invoice has been paid”. AI can now both analyse the image and read the accompanying text simultaneously in order to process the invoice appropriately.
We believe the capability of multimodal AI will facilitate a step-change in efficiency for accountants utilising AI-native products. For example, rather than using OCR technology to scan a receipt, leaving a bookkeeper to review and post it manually, AI will read, understand, and process the receipt from start to finish.
We are hiring!