At my last startup, we bought a full-featured customer support platform. We only used a small part of it. And, our internal tool for fixing issues was a bare-bones Django Admin panel.

Think of your own case. Aren’t most internal tools off-the-shelf? When you do build something, isn’t it rougher than what you ship to customers? That means the default is to avoid building and to under-invest if building.

A Justification

When Incident IO demoed their “Workbench” tool to AI tinkerers, it drew attention. It stood out because the reflex is to build only what is core and buy everything else. And because internal tools, when built, are rarely so polished.

When speed is survival, it feels hard to justify polish on something never seen by customers. It feels harder still to justify building and maintaining a tool when you could buy it. Every minute spent on an internal tool is a minute lost from the product that makes you stand out.

Buying, in contrast, concedes that your problem isn’t unique. Someone has already productised the solution. It offers immediacy where building denies it.

Tools as Leverage

The issue is that tools often support a core competency and are just as crucial.

At Briefcase, building reliable AI agents for accounting workflows is a core competency. The tooling is the infrastructure that lets us deliver industry-leading agents. It boosts the product's value and sets it apart.

Good tools make problems more visible, outcomes measurable, and trade-offs easier to weigh. It’s no surprise that Cruise has its own simulation platforms, or Datadog has its internal telemetry systems.

They also make us faster over time. Consider a GPS vs. a map. A map requires constant interpretation and manual correction. A GPS provides real-time feedback, quick error correction, and less cognitive load. Both work, but in my short experience orienteering, I found that the teams that didn’t get lost were using a GPS.

For Briefcase, industry-leading AI-agent tooling is non-negotiable.

Our Need for Purpose Built LLM Observability Tooling

Probabilistic models make opaque, non-deterministic decisions that need constant monitoring. Here, any tooling misfit adds up.

We found that no commercial product fit in every aspect. They assumed agent SDKs, but we had built our own integrations.

It meant only the inputs and outputs of our LLM calls were tracked - token usage or cost was not. It meant requests with non-standard payloads, like base64, couldn’t be re-run within playgrounds. It meant model switching and comparison within playgrounds wasn’t possible because we didn’t use an SDK that normalised requests. It all slowed us down in what mattered most.

The New Build–Buy Frontier

“Code is the new no-code” - Reuben

Despite payoff in the long-run, the upfront costs have long been prohibitive. But this buy/build frontier has shifted with LLMs. Building is no longer a blocking cost. We can create custom solutions with speed and ease.

For example, we’ve recently made the shift from Retool to building our own dashboards. It’s easier to vibe-code a new ORM query than to write a complex SQL statement by hand in Retool.

It also allows us to achieve a close integration of product and tooling. We’ve linked accuracy metrics to admin views that highlight specific transactions. It takes one click to see those that our AI has got wrong.

Of course, one still needs to be pragmatic. Before building, one should exhaust the obvious options first. And when building, we must constrain scope and complexity. It requires strong opinions and the reuse of existing infrastructure where possible.

Opinionated Systems Stay Simple

“Simple often means opinionated" - Jan

Opinions tell us what matters and what to ignore, simplifying requirements. For example, we chose not to support structured output in our LLM Playground. Our standard pattern starts with a chain-of-thought call. Then, a smaller model turns the unstructured output into JSON. Meaningful prompt engineering happens in the earlier stage, where structured output doesn’t apply.

We also chose to treat everything as a span, with parent-child relationships. We rebuilt hierarchies from flat data. This helped us avoid manual trace state management and made every step queryable. Our evaluation framework followed this pattern too. Each experiment turned into a span, with child spans for dataset examples and LLM calls.

Each LLM span also kept provider-specific pricing. It would roll up the span tree to show a detailed breakdown of cost and latency at every level.

Buying first also helps clarify opinions. You buy a fixed feature set and learn what you can and can’t live without, and where it falls short. If you need to build, like we did here, there’s a clear scope to follow instead of a growing list of extras.

For example, we knew our LLM observability platform had to support re-running agent steps. And quick changes to model inputs or the model within re-runs. We used proxy-based LLM persistence to implement LLM-specific tracing by default. We also normalised provider-specific formats into a unified schema. That consistency enabled the playground to reuse traced results from various providers.

Building on Existing Infrastructure

We added a single webhook endpoint to our existing API. Workers send span data to this webhook in a fire-and-forget pattern, and the API saves data to our SQL database using a dependency injection pattern. It calls code that works with any injected DB instance. This makes it easy to switch to another database later. If resources run low or the scale gets too big for a SQL database, we can adjust.

We also made use of existing reusable components and primitives, making polish easy.

Iterating Toward a Perfect Fit

Building gives opportunity to iterate based on concrete pain points.

We added support for uploading base64 and binary images to S3 for URL-based rendering. We also used Puppeteer to render HTML from LLM payloads.

We implemented custom experiment tags to keep track of evals during changes. When you make changes and check their impact on performance, it’s key to track which eval run and scores match each change.

We even integrated it within our product. When you view a transaction, we show a button. This button takes you to the agent traces for that transaction. It helps us debug issues right away and makes things easier.

The Compounding Impact

Building internal tooling is like investing in infrastructure. The payoff isn’t instant, but it accelerates everything built on top of it.

For example, we’ve seen the tool improve our iteration speed. Context tuning and evaluation are now part of a tight feedback loop. Visualising and aggregating cost and latency has also surfaced inefficiencies and guided optimisation. We’ve used it recently to decrease agent LLM cost by 10-15% and latency by 35-45%.

We will continue to invest in better internal tools than what is available on the market. Both for long-term velocity and for strengthening our core competencies.

We are hiring!

See careers