Double the Output, Half the Time!

Watch our webinar with Telleroo now

Briefcase x Telleroo Webinar

Double the Output, Half the Time!

Watch our webinar with Telleroo now

Building autopilot

Nov 9, 2025

Probabilistic AI agents paired with deterministic formal verification create the perfect combination for high trust domains like finance and accounting. At Briefcase, we run 12 AI agents in production, but to truly automate critical accounting workflows, we built Autopilot - our deterministic verification system that evaluates AI results in real-time and learns from human corrections. After 6 weeks in production, here's how we built it, why determinism matters, and the impact we've seen.

Invoice processing AI agent

Briefcase automates the complete invoice capture and bookkeeping workflow:

  1. Input processing: Unstructured data (emails, images, PDFs) containing invoices/receipts

  2. Data extraction: Core transaction information

  3. Line item split: Individual item separation and processing

  4. Classification: Transaction categorisation

  5. Tax determination: VAT rate calculation per line item

image.png

Example trace of our cost transaction processing AI agent combining LLM and business logic chains

The agent outputs fully processed transactions ready for the general ledger. Autopilot then determines whether to auto-publish or require human review based on three criteria:

  1. Document legibility

  2. Processing completeness

  3. Historical consistency

Document legibility

Uploaded documents can vary widely in quality so checking legibility is essential to ensure reliable downstream processing.

Visual transformers tokenize images into patches (16×16 or 32×32 pixels), flatten them into 1D vectors, add positional embeddings for spatial awareness, stack them into 2D matrices, and feed them through attention mechanisms. To transformers, images are just collections of vectors, they can't inherently distinguish between blurry and legible documents.

To get around this we run extraction multiple times. Identical results across runs indicate high document legibility confidence.

Processing completeness

This straightforward guardrail verifies:

  • Is all necessary data extracted?

  • Is transaction from an existing supplier?

  • Is document type postable to general ledger (invoice/receipt/credit note)?

All checks must pass for this guardrail to succeed.

Historical consistency

Now onto the fun stuff. Historical consistency enables pattern recognition and learns from corrections to determine auto-publishing safety. Our design constraints:

  1. Deterministic: Consistent results regardless of run count

  2. Generalisable: Serves diverse businesses (B2B SaaS, retail, hospitality)

  3. Business-isolated: Patterns from Business X never affect Business Y

  4. Self-improving: Adjusts confidence based on human corrections

  5. Recency-weighted: Recent transactions matter more than older ones

Why not LLM as a judge?

LLMs exhibit U-shaped confidence distributions rather than normal distributions and perform poorly on out-of-distribution examples (adversarial examples). Not to mention that it’s quite costly. LLM as a judge as it stands today has its place in data labelling, but less so in critical evaluation step determining if human should be in the loop or not.

Our approach: Bayesian statistical model

We evaluated two options:

  1. Simple statistical model

  2. Individual ML models per business

Option 2's complexity made it unsuitable for rapid iteration and baseline establishment. We chose a Bayesian approach because it’s a much simpler starting point and can be as effective.

The algorithm:

  1. Retrieve semantically similar published transactions

  2. Calculate evidence

  3. Calculate prior

  4. Perform Bayesian update

  5. Make autopilot decision

Implementation details

Similar transactions retrieval

We use RAG with reranking, maintaining isolated namespaces per business to ensure pattern learning remains business-specific.

More detailed overview about our RAG approach will be covered in future posts and is out of scope of this blog.

Evidence calculation

Field weights reflect relative importance:

export const FIELD_WEIGHTS = {
  supplier: 0.2,
  amountRange: 0.05,
  lineItemCount: 0.15,
  category: 0.25,
  taxRate: 0.25,
  quantity: 0.05,
  description: 0.05,
}

Recency decay ensures recent patterns take precedence:

const decay = 0.5 ** ((months - gracePeriod) / halfLife)

With example grace period = 3 months and half-life = 12 months:

  • 3-month-old transaction: weight = 1.0

  • 12-month-old transaction: weight = 0.5

  • 24-month-old transaction: weight = 0.25

We normalise weights using softmax to create a proper distribution:

export const softmax = ({
  weights,
  gamma,
}: {
  weights: number[]
  gamma: number
}): number[] => {
  const exponents = weights.map((weight) => Math.exp(gamma * weight))
  const sum = exponents.reduce((a, b) => a + b, 0)
  
  return exponents.map((exponent) => exponent / (sum || 1))
}

Field matching uses tiered scoring:

// supplier match calculation
let score: number
if (matchRate >= 0.85) {
	score = 1.0
} else if (matchRate >= 0.7) {
	score = 0.8
} else if (matchRate >= 0.55) {
	score = 0.4
} else {
	score = 0
}

Our internal debugging tool visualises individual decisions:

image.png

Prior calculation

Constants control learning behaviour:

export const BASE_PRIOR = 0.7
export const MIN_DAYS_SINCE_AUTO_PUBLISH = 7
export const POSITIVE_ADJUSTMENT_FACTOR = 0.03 // No correction → increase confidence
export const NEGATIVE_ADJUSTMENT_FACTOR = 0.05 // Correction → decrease confidence

Human corrections directly influence future decisions:

let adjustedPrior = basePrior

for (const transactions of autoPublishedTransactions) {
	if (transaction.humanCorrected) {
	  adjustment = -negativeAdjustmentFactor * recencyWeight
	} else {
	  adjustment = positiveAdjustmentFactor * recencyWeight
	}
	adjustedPrior += adjustment
}

Based on these adjustments we calculate new prior:

const adjustedPrior = Math.max(0, Math.min(1, adjustedPrior))

Bayesian probability calculation

export const calculatePosteriorProbability = ({
  prior,
  evidence,
  epsilon = 1e-6,
}: {
  prior: number
  evidence: number
  epsilon?: number
}): number => {
  const p = Math.min(1 - epsilon, Math.max(epsilon, prior))
  const e = Math.min(1 - epsilon, Math.max(epsilon, evidence))
  const numerator = e * p
  const denominator = numerator + (1 - e) * (1 - p)

  return numerator / denominator
}

The epsilon prevents division by zero while maintaining numerical stability.

image.png

Autopilot decision

Now that we have posterior we need to establish autopilot threshold. This should be tuned empirically and in following example it’s 0.8. Converging to a good threshold requires lots of tuning and analysis but usual rule of thumb is to start more conservative first and then relax it as you learn more about how it actually performs in production.

image.png

Result

Using Briefcase for our own bookkeeping, Autopilot reduced manual effort by over 80%, leaving only edge cases for review.

Screenshot 2025-11-09 at 02.18.58.png

Magic wand icons indicate auto-published transactions

image.png

Every decision includes full explainability to build user trust

Key takeaways

Deterministic heuristics complement probabilistic AI agents exceptionally well, enabling true end-to-end automation in high trust domains. Our approach successfully:

✅ Maintains determinism through Bayesian statistics

✅ Generalises across business types

✅ Isolates learning per business

✅ Improves through human feedback

✅ Weights recent data appropriately

We're not stopping here and are gonna push major improvements (including completely new paradigms) to our AI agents and formal verification mechanisms in 2026. Join us to build robust, scalable, AI-native platform that transforms entire industry.

We are hiring!