Data Extraction
Get the data out of the PDFs, statements and web pages.
Structured information locked inside unstructured documents, contracts, bank statements, expense receipts, listings, supplier invoices. I build pipelines that read them, validate the output, and write clean data into the systems you already use.
01, The problem
Your data is fine. It's just not where you can use it.
Every business has data trapped inside documents nobody designed to be queried. A contract has the parties, the term, the fees and the renewal date, but they're prose, not fields. A bank statement has 200 transactions, but as a PDF, not a spreadsheet. A supplier portal has the pricing your team needs to monitor, but only behind a login, only on demand, only when someone remembers.
The cost is usually invisible because the work is spread across the team in 10-minute chunks. Someone retypes invoice line items. Someone screenshots a competitor's listing. Someone copies key terms out of a contract into a tracking sheet. Each instance is small. Across a year, it's hundreds of hours of senior-ish time on data entry that should never have happened.
What you actually need is a pipeline that reads the source, extracts the structured output, validates it, and writes it where you can use it. Cheaply, repeatedly, reliably.
02, What gets extracted
Four recurring shapes of extraction work.
Contract data extraction, parties, dates, terms, fees, renewal, liability caps, governing law, unusual clauses. From a PDF, into a structured record in your matter or CRM system. The clause text travels with it for audit.
Statement and receipt parsing, bank statements, supplier invoices, expense receipts. Transactions categorised, mapped to your chart of accounts, with the source line referenced. Confidence scores let humans handle the exceptions, not the bulk.
Web monitoring, competitor pricing, content drops, supplier portals, listing changes. Scheduled scrapers that pull on a cadence, dedupe against the last run, and surface only the diffs.
Form and email parsing, turning the messy text of an enquiry email or a free-form application into the structured shape your CRM expects. Names, requirements, budgets, timelines, extracted, validated, and routed.
Each shape uses the same pipeline architecture; the difference is the source format, the schema, and how the output is delivered.
03, How it works
Structured output, confidence scores, exceptions to humans.
Every extraction pipeline I build follows the same skeleton:
Schema first, I write down exactly what the output looks like before any code runs. Field names, types, validation rules, what counts as ambiguous. The schema is the contract between the pipeline and the downstream systems.
Read, depending on the source: native PDF parsing for clean documents, Claude's vision capability for scanned or messy PDFs, Playwright for web pages that need to be rendered, plain API calls for everything else.
Extract, Claude with structured output mode. The model is constrained to return JSON matching the schema. Where it's genuinely uncertain, it returns null rather than guessing, and the field is flagged for review.
Validate, domain rules run on the output. Dates have to be real dates. Totals have to add up. References have to exist. Failures get tagged, not silently dropped.
Route, clean output writes to your system of record. Flagged output lands in a review queue with the source document side-by-side. A human resolves; the system learns from the correction.
What you don't get: an AI that's confidently wrong without you knowing. The pipeline tells you exactly what it was unsure about and why.
04, Technical stack
Stack built for accuracy, audit, and cost.
Claude API + structured output
Claude Sonnet for the harder extraction work (contracts, nuanced documents). Haiku for high-volume, well-shaped sources (transaction parsing, web monitoring). Structured-output mode constrains the model to your schema.
Playwright for the web
For sources behind logins, dynamic rendering, or hostile to scraping: Playwright running headless Chromium, on a schedule, with rate-limited polite crawling. Built to last, not to break on the next platform update.
Postgres + your CRM
Structured output lands in a Postgres staging layer with full lineage (which document, which run, which prompt version). From there into the CRM, practice tool, or accounting system you actually use day-to-day.
05, Result
What this looks like in practice.
A contract data-extraction pipeline for a law firm: 200 historic NDAs run through to produce a structured database of parties, terms and unusual clauses. From "we have a folder of PDFs" to "we can search and filter our contract estate" in three weeks. Stack: Claude Sonnet, Node.js, Postgres, integration with iManage.
A receipt-parsing pipeline for an accountancy practice: 1,200 transactions a month across 80 clients, parsed from PDFs and image uploads, categorised against each client's chart of accounts, dropped into Xero. From 16 hours of staff time a week to 4 hours of exception handling. Build time: 5 weeks.
A competitor-monitoring pipeline for an agency: weekly scrape of 14 competitor sites, diffed against the last run, summarised into a Friday brief. The strategist gets the changes that matter, ignores the noise. Build time: 2 weeks.
06, Is this right for you?
Where extraction fits and where it doesn't.
A good fit if:
You have a recurring extraction task, at least 50 documents a month, or a few high-value documents where accuracy matters.
The output shape is roughly definable, even if the input is messy, you know what fields you need at the end.
You have somewhere structured for the output to land (a database, a spreadsheet, a system of record).
You're comfortable with a human review queue for the cases the AI flags as uncertain.
Probably not the right move if:
Your documents are wildly heterogeneous, every one has a different shape and you can't define a target schema.
You're extracting from sources behind aggressive anti-bot protection (some sites genuinely cannot be scraped reliably).
The volume is so low that doing it by hand is cheaper than the build.
Accuracy needs to be 100% with no human review at all, that's a different category of problem and rarely solvable with current AI.
Related
Where this fits in.
Part of the wider AI implementation work I do. The other outcome lanes:
Workflow automation, for the broader pipelines that data extraction usually feeds.
Custom AI agents, when the extraction needs judgement, not just parsing.
Working in a specific sector? For law firms · For accountancy
Got data trapped somewhere?
Thirty-minute scoping call. Bring an example of the source documents and roughly what you want to do with the output. I'll tell you whether an extraction pipeline is the right answer, what it would cost, and how long it would take.
Book a scoping call