Building a vertical AI agent

Apr 29, 2025

Building vertical AI agents varies by domain, but in general they all have similar toolkits to their disposal. In this post we will discuss how we build AI agents at Briefcase. Briefcase automates tasks in the accounting industry where predictability, consistency and accuracy really matters.

A common workflow for users of Briefcase is to upload their invoices into the system. Briefcase will then perform a bookkeeping workflow on each invoice which consists of multiple reasoning tasks such as extraction, line item split, categorisation and VAT rate assignment. Here's a snippet of our AI agent:

Let’s cover few of these sections.

Chain of Thought

Chain of thought is a very popular technique to help the LLM first reason about the task and only then perform it. This generally speaking improves the quality of the output and reduces hallucinations.

However, they don’t come for free and have their own pros & cons

Latency & Cost: CoT typically increases latency of tasks done by seconds or even 10s of seconds, cost increases proportionally
Consistency & Predictability: CoT significantly helps the LLM solve a task with some degree of predictability that is usually achieved with a few shot examples
Generalisation & Accuracy: We’ve seen CoT significantly improve tackling out of distribution examples for which we haven’t optimised prompts/evals. However, we’ve also seen CoT degrade overall performance where tasks are very niche and prompt steps are too ambiguous
Faithfulness: While LLMs tend to ground their final answer in their reasoning, it’s definitely not always the case and evals for the CoT task should have a faithfulness metric

When we apply CoT

Task is sufficiently difficult such that the LLM hallucinates
Input data to the LLM is highly unstructured
We have enough latency budget for the given task

Prompt structure

We found that this prompt structure works quite well for CoT prompts:

const systemPromptStructure = `
<describe AI assistant role>

<generic context about the task>

<step by step instructions>

<few shot examples>
`

const userPrompts = [
	// 1. images for multimodal tasks
	// 2. additional user context such as RAG
]

Less is more

Prompt hygiene naturally declines over time as we keep iterating, adding one sentence here and another there to cover new scenarios. The overall coherence of the prompt then decreases and the LLM may start to hallucinate more as a result.

Therefore the way we think about prompting is the same as for everything else in engineering - simplify, simplify, simplify.

best prompt is no prompt
try to reduce prompt first before expanding it
try to explain the problem that you are solving like to 5 year old, if you can’t describe the task in simple enough terms and you leave ambiguity in, the LLM will inevitably hallucinate
keep prompts and examples unambiguous, avoid words like could/may/might
keep system prompts generic and few shots / user prompts specific to individual customer or use case

RAG

Vertical-AI tasks are highly contextual. To automate them, we have to inject task-specific context into each prompt. That’s particularly hard in fields like finance or law, because semantic similarity inside a domain differs from general-purpose similarity in everyday language. Most embedding models are trained for broad use and even vertical models are usually built on data that is still too generic to serve the needs of a single, specialised task.

Embedding space

In accounting there is a task called transaction categorisation. The list of categories (or chart of accounts) is bespoke to each individual business and can also evolve and change over time. Here is an example categorisation that can apply to B2B SaaS business:

The company’s purchase of Google Workspace is recorded under Subscriptions, whereas the AWS spend is categorised as Cost of Goods Sold. AWS costs rise with the traffic the SaaS product serves, while Gmail licences scale with headcount rather than incoming traffic.

Injecting the right context

In order for the LLM to categorise transactions with a high degree of consistency and accuracy, it needs examples of previously categorised transactions that are genuinely semantically similar. While a human accountant would be able to semantically distinguish between Google and AWS transactions, even best available embedding models would struggle to tell them apart.

To address this we are applying few techniques

First we spike the strengths and weaknesses of the best finance-aware embedding models and establish base thresholds for false positives and negatives
Experiment with the definition of embedded entity/chunk. We iteratively redefine what counts as a chunk, removing noisy attributes and adding steering metadata to maximise similarity with the right examples and minimise similarity with the wrong ones.
Recalibrate new thresholds with the aim to minimise false positives
On retrieval only consider chunks that are above the recalibrated threshold

Experimenting with chunks is similar to prompt engineering - it’s a lot of trial and error. And similar to prompt engineering, we want to be able to complete the task well for data outside the experimentation/evals dataset, meaning that we need to keep context within chunks generic enough to not overfit.

Even though with a few techniques we can get the most out of the available embedding models, for vertical AI application these often hit their limits. Therefore we will be developing inhouse solutions that are optimised for the financial tasks that we are automating.

—

We will be covering other parts of our agents in future posts. If you came this far and this sounds interesting to you, we are hiring Founding Product Engineers, please apply to jan@briefcase.so.

We are hiring!

See careers