Building autopilot
Nov 9, 2025

Probabilistic AI agents paired with deterministic formal verification create the perfect combination for high trust domains like finance and accounting. At Briefcase, we run 12 AI agents in production, but to truly automate critical accounting workflows, we built Autopilot - our deterministic verification system that evaluates AI results in real-time and learns from human corrections. After 6 weeks in production, here's how we built it, why determinism matters, and the impact we've seen.
Invoice processing AI agent
Briefcase automates the complete invoice capture and bookkeeping workflow:
Input processing: Unstructured data (emails, images, PDFs) containing invoices/receipts
Data extraction: Core transaction information
Line item split: Individual item separation and processing
Classification: Transaction categorisation
Tax determination: VAT rate calculation per line item

Example trace of our cost transaction processing AI agent combining LLM and business logic chains
The agent outputs fully processed transactions ready for the general ledger. Autopilot then determines whether to auto-publish or require human review based on three criteria:
Document legibility
Processing completeness
Historical consistency
Document legibility
Uploaded documents can vary widely in quality so checking legibility is essential to ensure reliable downstream processing.
Visual transformers tokenize images into patches (16×16 or 32×32 pixels), flatten them into 1D vectors, add positional embeddings for spatial awareness, stack them into 2D matrices, and feed them through attention mechanisms. To transformers, images are just collections of vectors, they can't inherently distinguish between blurry and legible documents.
To get around this we run extraction multiple times. Identical results across runs indicate high document legibility confidence.
Processing completeness
This straightforward guardrail verifies:
Is all necessary data extracted?
Is transaction from an existing supplier?
Is document type postable to general ledger (invoice/receipt/credit note)?
All checks must pass for this guardrail to succeed.
Historical consistency
Now onto the fun stuff. Historical consistency enables pattern recognition and learns from corrections to determine auto-publishing safety. Our design constraints:
Deterministic: Consistent results regardless of run count
Generalisable: Serves diverse businesses (B2B SaaS, retail, hospitality)
Business-isolated: Patterns from Business X never affect Business Y
Self-improving: Adjusts confidence based on human corrections
Recency-weighted: Recent transactions matter more than older ones
Why not LLM as a judge?
LLMs exhibit U-shaped confidence distributions rather than normal distributions and perform poorly on out-of-distribution examples (adversarial examples). Not to mention that it’s quite costly. LLM as a judge as it stands today has its place in data labelling, but less so in critical evaluation step determining if human should be in the loop or not.
Our approach: Bayesian statistical model
We evaluated two options:
Simple statistical model
Individual ML models per business
Option 2's complexity made it unsuitable for rapid iteration and baseline establishment. We chose a Bayesian approach because it’s a much simpler starting point and can be as effective.
The algorithm:
Retrieve semantically similar published transactions
Calculate evidence
Calculate prior
Perform Bayesian update
Make autopilot decision
Implementation details
Similar transactions retrieval
We use RAG with reranking, maintaining isolated namespaces per business to ensure pattern learning remains business-specific.
More detailed overview about our RAG approach will be covered in future posts and is out of scope of this blog.
Evidence calculation
Field weights reflect relative importance:
Recency decay ensures recent patterns take precedence:
With example grace period = 3 months and half-life = 12 months:
3-month-old transaction: weight = 1.0
12-month-old transaction: weight = 0.5
24-month-old transaction: weight = 0.25
We normalise weights using softmax to create a proper distribution:
Field matching uses tiered scoring:
Our internal debugging tool visualises individual decisions:

Prior calculation
Constants control learning behaviour:
Human corrections directly influence future decisions:
Based on these adjustments we calculate new prior:
Bayesian probability calculation
The epsilon prevents division by zero while maintaining numerical stability.

Autopilot decision
Now that we have posterior we need to establish autopilot threshold. This should be tuned empirically and in following example it’s 0.8. Converging to a good threshold requires lots of tuning and analysis but usual rule of thumb is to start more conservative first and then relax it as you learn more about how it actually performs in production.

Result
Using Briefcase for our own bookkeeping, Autopilot reduced manual effort by over 80%, leaving only edge cases for review.

Magic wand icons indicate auto-published transactions

Every decision includes full explainability to build user trust
Key takeaways
Deterministic heuristics complement probabilistic AI agents exceptionally well, enabling true end-to-end automation in high trust domains. Our approach successfully:
✅ Maintains determinism through Bayesian statistics
✅ Generalises across business types
✅ Isolates learning per business
✅ Improves through human feedback
✅ Weights recent data appropriately
We're not stopping here and are gonna push major improvements (including completely new paradigms) to our AI agents and formal verification mechanisms in 2026. Join us to build robust, scalable, AI-native platform that transforms entire industry.
We are hiring!
