Double the Output, Half the Time!

Watch our webinar with Telleroo now

Briefcase x Telleroo Webinar

Double the Output, Half the Time!

Watch our webinar with Telleroo now

Building Tax-Aware AI

Dec 15, 2025

When an AI agent classifies a transaction's VAT treatment, it shouldn't be guessing, it should cite the exact HMRC guidance it relied on. We built a retrieval system that gives our agents access to HMRC’s published VAT guidance (VAT Notices), enabling them to make grounded decisions backed by authoritative sources.

The Problem: AI Agents Need Legal Context

Our invoice processing agents classify thousands of transactions for VAT treatment. Each classification (e.g., standard-rated, zero-rated, exempt, or outside scope) has real financial implications. Getting it wrong means compliance issues.

The challenge: how do you give an AI agent access to legislation in a way that's accurate, up-to-date, and citable?

Fine-tuning wasn’t the answer. Guidance changes. We needed traceability: when an agent says a transaction is zero-rated, we want it to cite the specific HMRC notice and section. We also needed freshness: when HMRC updates a notice, our agents should reflect that change without retraining.

The solution: Retrieval-Augmented Generation (RAG). We built a semantic search system over HMRC VAT Notices that retrieves relevant VAT guidance at inference time, providing authoritative context for every decision.

This post walks through how we went from messy GOV.UK HTML to a production-ready retrieval layer that our agents call on every VAT decision.

Architecture Overview

Why HMRC notices? UK VAT decisions sit across two layers: primary legislation (for example, the Value Added Tax Act 1994 and associated regulations) and HMRC’s interpretive guidance (VAT Notices and guidance pages). In practice, VAT Notices contain the operational detail our agents need day-to-day. These notices explain how to apply the rules to concrete scenarios like “is pet food zero-rated?” or “when does the reverse charge apply?”. Primary legislation defines the framework; the notices answer the questions we most often face at classification time. We started with notices because they offered the highest signal-to-noise ratio for practical accuracy, then designed the system so we can extend to primary legislation later for edge cases and audit trails.

Our pipeline transforms raw HTML from GOV.UK into queryable vector embeddings:

data.svg

Technology choices:

  • Voyage AI (voyage-finance-2): domain-specific embeddings trained on financial and legal text, outperforming other models on our retrieval benchmarks

  • Turbopuffer: managed vector database with low-latency queries and hybrid search capabilities

The result: 5,635 chunks from 100+ HMRC VAT Notices, each semantically indexed and retrievable in milliseconds.

Building the Pipeline

Stage 1: Scraping and Tokenisation

We start with the GOV.UK VAT notices collection, the authoritative source for UK VAT guidance. Each notice is a structured HTML document with headings, paragraphs, lists, tables, and callouts.

Our scraper converts this HTML into a flat stream of typed tokens:

export enum BlockType {
  ADDRESS = 'address',
  INFO = 'info',
  EXAMPLE = 'example',
  PARAGRAPH = 'paragraph',
  LIST = 'list',
  TABLE = 'table',
  HEADING = 'heading',
  CALLOUT = 'callout',
}

This tokenisation preserves semantic meaning while normalising the varied HTML structures across different notices. A warning callout is tagged differently from a regular paragraph, which matters when we later decide what context to include.

Stage 2: Hierarchical Parsing

Legal and quasi-legal documents have inherent structure. A VAT notice isn’t just a wall of text: it’s organised into sections (1, 2, 3…), subsections (1.1, 1.2…), and nested content with markers like (a), (b), (c) or roman numerals, which then contain examples, tables, and lists.

The challenge is that HMRC notices don’t follow a consistent schema. The HTML structure varies between notices; some use <h2> for major headings, others rely on different heading levels. Section numbers also often appear in the visible text (for example, “1.1 What this notice covers”) rather than semantic markup. We can’t just traverse a neat XML tree.

Our solution is to reconstruct the hierarchy dynamically by combining two signals:

  1. Heading tags (<h2>, <h3>, etc.) as structural breaks

  2. Numbering patterns in text (1., 1.1, (a), i.) as nesting depth

The parser maintains a scope stack to track our current position in the tree:

type Section = {
  number: string           // e.g., "4.2"
  title: string            // e.g., "Zero-rated supplies"
  content: ContentBlock[]
  references: Reference[]
  subsections: Section[]   // recursively nested
}

How it works (high level):

  • Headings start new structural scopes (with flexible handling for inconsistent heading levels)

  • Numbered markers like “1.1” define nesting depth beneath the current scope

  • Lettered markers (a), (b) nest under the current numbered section

  • Roman numerals i., ii. nest under lettered items

  • A paragraph ending with : captures subsequent blocks as children

This dual-signal approach reconstructs a consistent tree even when the underlying HTML is messy.

Example: Animals and animal food (VAT Notice 701/15)

image.png

…is parsed into a hierarchy where a table can attach to the paragraph that introduces it:

{
  "number": "3",
  "title": "Birds and fish",
  "subsections": [
    {
      "number": "3.1",
      "title": "Birds",
      "content": [
        {
          "block_type": "paragraph",
          "text": "Most breeds of chicken are zero-rated, as are game birds and ostriches. Ornamental breeds of birds are standard-rated."
        },
        {
          "block_type": "paragraph",
          "text": "The following breeds of ducks, geese and turkeys are zero-rated:",
          "subparagraphs": [
            {
              "nested_blocks": [
                {
                  "block_type": "table",
                  "headers": ["Type of fowl", "Breed"],
                  "rows": [
                    ["Ducks", "Aylesbury, Campbell (Khaki Campbell), Indian Runner, Muscovy, Pekin and derivatives and crossbreeds of these"],
                    ["Geese", "Brecon Buff, Chinese Commercial, Embdem, Roman, Toulouse and derivatives and crossbreeds of these"],
                    ["Turkeys", "Beltsville White, British White, Broadbreasted White, Bronze (Broadbreasted Bronze), Norfolk Black and derivatives and crossbreeds of these"]
                  ]

This hierarchy is critical for the next stage.

Stage 3: Semantic Chunking

Many RAG systems chunk by fixed character counts (for example, split every 500 characters with overlap). We took a different approach: chunk at semantic boundaries.

VAT Notices already define the units humans reason with: sections and subsections. We chunk at the leaf section level, and prepend just enough parent context (notice title, section path) so each chunk is interpretable in isolation. This avoids splitting concepts mid-paragraph and reduces accidental mixing of unrelated rules.

Each stored chunk includes:

  • the section text

  • its hierarchical path (for attribution and UI display)

  • stable identifiers (notice number, section number)

  • the source URL for citations

We also remove boilerplate (feedback links, navigation, “your rights and obligations”) to reduce retrieval noise.

Embeddings and Storage

Vector Storage Schema

Turbopuffer stores each chunk with filterable metadata:

export const schema = {
  // filterable
  notice_number:  { filterable: true },
  section_number: { filterable: true },
  source_type:    { filterable: true },

  // full-text searchable
  notice_title:      { full_text_search: true },
  section_title:     { full_text_search: true },
  hierarchical_path: { full_text_search: true },
  text:              { full_text_search: true },

  // stored (for attribution), not indexed
  source_ref: {},
  source_url: {},
}

This enables hybrid queries: vector similarity search combined with filters like “only search within Notice 700” or keyword matching for specific terms.

Visualising the Embedding Space

The t-SNE visualisation below shows our 5,635 chunks projected into 2D, coloured by notice number:

vat_embeddings_tsne.png

Semantically similar topics cluster together: food-related notices group in one region, and financial-services guidance clusters elsewhere. It’s a useful sanity check that our chunking and embedding strategy preserves topical structure.

Agent Integration

The retrieval function is the interface between our agents and the VAT guidance base:

query (1).svg

Query construction: For each invoice line item, we generate a short, standardised phrase that captures what was purchased (for example, “diesel fuel”, “coffee capsules”, “software subscription”). This is derived from the raw line-item text, but we strip noisy attributes like merchant names, locations, order IDs, and other incidental details that could otherwise bias retrieval.

This one step materially improves retrieval quality: the query carries the semantic signal we care about, without dragging in irrelevant tokens that hurt precision.

Retrieval pipeline:

  1. Embed the canonical description with voyage-finance-2

  2. Run an ANN search in Turbopuffer for top-K candidates (we over-fetch internally)

  3. Truncate to fit the context budget (for example, ~14,000 characters / ~4,000 tokens)

  4. Return the final top-K chunks with source URLs so the agent can cite them

Example: for the query “diesel fuel”, the system retrieves:

  • Fuel and power (VAT Notice 701/19) > 6 Oils > 6.2 Standard-rated supplies > 6.2.2 Road fuel

  • Fuel and power (VAT Notice 701/19) > 4 Gases > 4.4 Road fuel

The agent can then make a VAT treatment decision backed by the retrieved guidance, with citations users can verify.

We surface the retrieved sources in the UI so users can inspect the justification:

example-starlink.png

Evaluation

We evaluate the system at two levels: retrieval quality and end-to-end VAT classification accuracy.

Retrieval evaluation:

We measure whether the system retrieves the correct VAT guidance using three metrics:

  • Content similarity score: We embed both the retrieved and expected passages, then compute cosine similarity. This measures semantic relevance: did we retrieve content that is actually about the right topic?

  • Source URL precision: We compare retrieved URLs against ground truth. If the correct guidance is Notice 701/1, section 3.2, did we retrieve that exact source?

  • Retrieval count score: Did we return the expected number of results? Too few indicates missing context; too many introduces noise and increases hallucination risk.

End-to-end VAT classification evaluation:

Retrieval is only valuable if it improves agent decisions, so we also evaluate the full pipeline:

  • VAT rate accuracy: Does the agent select the correct VAT treatment?

  • Reasoning quality: Is the justification coherent and grounded in the retrieved guidance?

  • Legislation citation accuracy: Does the agent cite the correct HMRC notice and section for its decision?

We maintain separate evaluation datasets for different invoice types, each with manually verified expected outcomes.

Lessons Learnt

Semantic chunking beats fixed-size chunking for legal-style documents. Character-based splitting breaks mid-sentence and mixes unrelated concepts. Sections and subsections are the boundaries humans use, so they’re the boundaries our retrieval system should use too. The trade-off is variable chunk size (our P95 is ~2,100 tokens), which forces careful context-budget management.

Formatting overhead is real. We measured ~24% token overhead from markdown formatting. Dense paragraphs use fewer tokens than the same content formatted into lists and tables. We accepted the trade-off for readability; the hierarchy makes the context easier for the model (and users) to parse.

HMRC guidance is critical for day-to-day classification. In practice, VAT Notices carried most of the decision-making signal for our use-cases; retrieving primary legislation was most useful for edge cases and audit trails rather than improving routine classification accuracy.

Good datasets don’t happen by accident. A robust parsing structure made everything downstream easier: better chunking, better evaluation, and faster iteration.

Cross-references are valuable metadata. VAT Notices reference each other extensively (“see VAT Notice 700/1, section 4.2”). We extract and store these references, which sets us up for future graph-based retrieval and navigation.

We are hiring!