Probabilistic AI agents paired with deterministic formal verification create the perfect combination for high trust domains like finance and accounting. At Briefcase, we run 12 AI agents in production, but to truly automate critical accounting workflows, we built Autopilot - our deterministic verification system that evaluates AI results in real-time and learns from human corrections. After 6 weeks in production, here's how we built it, why determinism matters, and the impact we've seen.

Invoice processing AI agent

Briefcase automates the complete invoice capture and bookkeeping workflow:

Input processing: Unstructured data (emails, images, PDFs) containing invoices/receipts
Data extraction: Core transaction information
Line item split: Individual item separation and processing
Classification: Transaction categorisation
Tax determination: VAT rate calculation per line item

Example trace of our cost transaction processing AI agent combining LLM and business logic chains

The agent outputs fully processed transactions ready for the general ledger. Autopilot then determines whether to auto-publish or require human review based on three criteria:

Document legibility
Processing completeness
Historical consistency

Document legibility

Uploaded documents can vary widely in quality so checking legibility is essential to ensure reliable downstream processing.

Visual transformers tokenize images into patches (16×16 or 32×32 pixels), flatten them into 1D vectors, add positional embeddings for spatial awareness, stack them into 2D matrices, and feed them through attention mechanisms. To transformers, images are just collections of vectors, they can't inherently distinguish between blurry and legible documents.

To get around this we run extraction multiple times. Identical results across runs indicate high document legibility confidence.

Processing completeness

This straightforward guardrail verifies:

Is all necessary data extracted?
Is transaction from an existing supplier?
Is document type postable to general ledger (invoice/receipt/credit note)?

All checks must pass for this guardrail to succeed.

Historical consistency

Now onto the fun stuff. Historical consistency enables pattern recognition and learns from corrections to determine auto-publishing safety. Our design constraints:

Deterministic: Consistent results regardless of run count
Generalisable: Serves diverse businesses (B2B SaaS, retail, hospitality)
Business-isolated: Patterns from Business X never affect Business Y
Self-improving: Adjusts confidence based on human corrections
Recency-weighted: Recent transactions matter more than older ones

Why not LLM as a judge?

LLMs exhibit U-shaped confidence distributions rather than normal distributions and perform poorly on out-of-distribution examples (adversarial examples). Not to mention that it’s quite costly. LLM as a judge as it stands today has its place in data labelling, but less so in critical evaluation step determining if human should be in the loop or not.

Our approach: Bayesian statistical model

We evaluated two options:

Simple statistical model
Individual ML models per business

Option 2's complexity made it unsuitable for rapid iteration and baseline establishment. We chose a Bayesian approach because it’s a much simpler starting point and can be as effective.

The algorithm:

Retrieve semantically similar published transactions
Calculate evidence
Calculate prior
Perform Bayesian update
Make autopilot decision

Implementation details

Similar transactions retrieval

We use RAG with reranking, maintaining isolated namespaces per business to ensure pattern learning remains business-specific.

More detailed overview about our RAG approach will be covered in future posts and is out of scope of this blog.

Evidence calculation

Field weights reflect relative importance:

export const FIELD_WEIGHTS = {
  supplier: 0.2,
  amountRange: 0.05,
  lineItemCount: 0.15,
  category: 0.25,
  taxRate: 0.25,
  quantity: 0.05,
  description: 0.05,
}

Recency decay ensures recent patterns take precedence:

const decay = 0.5 ** ((months - gracePeriod) / halfLife)

With example grace period = 3 months and half-life = 12 months:

3-month-old transaction: weight = 1.0
12-month-old transaction: weight = 0.5
24-month-old transaction: weight = 0.25

We normalise weights using softmax to create a proper distribution:

export const softmax = ({
  weights,
  gamma,
}: {
  weights: number[]
  gamma: number
}): number[] => {
  const exponents = weights.map((weight) => Math.exp(gamma * weight))
  const sum = exponents.reduce((a, b) => a + b, 0)
  
  return exponents.map((exponent) => exponent / (sum || 1))
}

Field matching uses tiered scoring:

// supplier match calculation
let score: number
if (matchRate >= 0.85) {
	score = 1.0
} else if (matchRate >= 0.7) {
	score = 0.8
} else if (matchRate >= 0.55) {
	score = 0.4
} else {
	score = 0
}

Our internal debugging tool visualises individual decisions:

Prior calculation

Constants control learning behaviour:

export const BASE_PRIOR = 0.7
export const MIN_DAYS_SINCE_AUTO_PUBLISH = 7
export const POSITIVE_ADJUSTMENT_FACTOR = 0.03 // No correction → increase confidence
export const NEGATIVE_ADJUSTMENT_FACTOR = 0.05 // Correction → decrease confidence

Human corrections directly influence future decisions:

let adjustedPrior = basePrior

for (const transactions of autoPublishedTransactions) {
	if (transaction.humanCorrected) {
	  adjustment = -negativeAdjustmentFactor * recencyWeight
	} else {
	  adjustment = positiveAdjustmentFactor * recencyWeight
	}
	adjustedPrior += adjustment
}

Based on these adjustments we calculate new prior:

const adjustedPrior = Math.max(0, Math.min(1, adjustedPrior))

Bayesian probability calculation

export const calculatePosteriorProbability = ({
  prior,
  evidence,
  epsilon = 1e-6,
}: {
  prior: number
  evidence: number
  epsilon?: number
}): number => {
  const p = Math.min(1 - epsilon, Math.max(epsilon, prior))
  const e = Math.min(1 - epsilon, Math.max(epsilon, evidence))
  const numerator = e * p
  const denominator = numerator + (1 - e) * (1 - p)

  return numerator / denominator
}

The epsilon prevents division by zero while maintaining numerical stability.

Autopilot decision

Now that we have posterior we need to establish autopilot threshold. This should be tuned empirically and in following example it’s 0.8. Converging to a good threshold requires lots of tuning and analysis but usual rule of thumb is to start more conservative first and then relax it as you learn more about how it actually performs in production.

Result

Using Briefcase for our own bookkeeping, Autopilot reduced manual effort by over 80%, leaving only edge cases for review.

Magic wand icons indicate auto-published transactions

Every decision includes full explainability to build user trust

Key takeaways

Deterministic heuristics complement probabilistic AI agents exceptionally well, enabling true end-to-end automation in high trust domains. Our approach successfully:

✅ Maintains determinism through Bayesian statistics

✅ Generalises across business types

✅ Isolates learning per business

✅ Improves through human feedback

✅ Weights recent data appropriately

We're not stopping here and are gonna push major improvements (including completely new paradigms) to our AI agents and formal verification mechanisms in 2026. Join us to build robust, scalable, AI-native platform that transforms entire industry.

We are hiring!

See careers

Introducing AI Autopilot ›

Blog

Changelog

Careers