Topics in AI
    13 min read

    Vision LLMs Are Eating Classical OCR

    Vision language models now read scanned title packages with accuracy that rivals dedicated OCR. Here's what that shift means for Quebec title review and the hybrid pipeline inside Cleardeal.

    ByJames R. GosnellEducational content. Not legal advice.

    Vision LLMs Are Eating Classical OCR

    The Two-Stage Pipeline Just Collapsed

    For two decades, document automation in real estate followed the same recipe: run OCR on the scan, then point regular expressions and trained extractors at the text. Most accuracy was lost in the handoff.

    In the first half of 2026, that recipe is on the way out. Frontier vision models read a page end to end. Claude 4.5 Sonnet, GPT-5, and Gemini 2.5 Pro accept a scanned PDF and return structured JSON directly, no OCR step. GPT-5 hits roughly 98% extraction accuracy on text-heavy invoices, Claude 4.5 Sonnet around 97%, Gemini 2.5 Pro at 96%. Specialized open models like GLM-OCR still lead OmniDocBench v1.5 with a 94.6 score, but on enterprise legal documents the frontier models are inside the margin.

    The bigger shift is engineering cost. Each new register format used to mean a new extractor or regex bundle. With a vision model, a new document type means a new prompt.

    Why Title Packages Broke Classical OCR

    A Quebec title package from the Land Register runs forty to a hundred pages: inscriptions from 1880 to 2026, handwritten cadastral references, notary stamps on top of typed text, multi-column registry pages where the layout is the meaning, and discharge notices stapled behind the original hypothec. Scanned on different days at different resolutions by different clerks.

    Classical OCR turns all of that into a single column of characters and asks the downstream parser to figure out what belongs to what. The parser cannot tell that a date next to "radiation" means the hypothec was discharged, while the same date next to "inscription" means it was created. It has no idea that the rotated stamp on page eight is the registrar's seal validating the entry on page seven.

    Vision models read the page the way a human articling student does. They link a stamp to the date next to it. They follow a table across columns without explicit rules. They notice when an entry has been struck through with diagonal pen lines and treat it accordingly. The model returns structured records directly.

    Where Classical OCR Still Earns Its Spot

    The vision-first story is real, but the eulogy for classical OCR is premature. Three things keep it in production.

    Cost. At volume, classical OCR is close to free. Gemini Flash 2.0 will process ten thousand pages for under two dollars. Frontier models are not in that range: GPT-4 Vision and Claude Opus at scale run fifty to a hundred dollars per ten thousand pages. At a million pages a year, running a frontier model only when needed is a real budget line.

    Latency. A vision LLM call takes seconds. A classical OCR call takes milliseconds. For real-time search over a title package, that gap is the user experience.

    Deterministic output. A vision LLM given the same image twice returns substantially the same JSON, but not byte-for-byte identical. The worst failure mode is silent fabrication: a date the model invents because the field looked smudged. The lawyer signing has to defend every number, and a deterministic text layer gives the reviewer a stable source to check against.

    The right architecture in 2026 routes pages to the model that fits the job.

    The Cleardeal Pipeline in May 2026

    Cleardeal is a multi-tenant title-review SaaS for Quebec legal teams, live at cleardeal.ca. It pulls review requests from a firm's Microsoft 365 inbox, runs document understanding on the title package, and writes a draft opinion letter back as a DOCX through Microsoft Graph. The stack is Vite, React, TypeScript, Supabase, and Deno edge functions, with OpenAI Vision on the document layer.

    The pipeline is hybrid by design. A first-pass classical OCR builds a deterministic searchable text layer for the review interface. The same pages then go to OpenAI Vision, which reads each registration as a unit and returns structured records: registration number, date, registered right, parties, affected immovable, confidence score. The reviewer sees the model output and the source crop side by side, and signs only after reconciling both.

    The frontier vision wave changes the routing logic. The single Vision call that runs on every page today can be split. Clean modern registry pages go through a cheap fast model. Older handwritten entries, ambiguous stamps, and low-confidence pages go through a frontier model with grounding requirements. A second model checks anything the first returned with low confidence. Cost per file drops because most pages do not need the frontier model, and accuracy on the hard pages rises because that model no longer averages its attention across forty pages of boilerplate.

    Industry surveys put AI adoption for title commitment review at roughly 47% in 2026, with 85% to 92% accuracy on exception categorization. The model handles triage and drafting; the opinion stays with the signing notary or lawyer.

    What to Test Before You Rip Out the Old Stack

    Any firm moving its pipeline off classical OCR and onto a vision LLM should run four tests on its own files first.

    1. Accuracy on real registers. Sample fifty to a hundred recent title packages. Score every field against the source page, not against a previous extraction. Vendor demos run on clean files. Your files are not test files.
    2. Hallucination guards. A blank field is recoverable; a fabricated date is a malpractice complaint. The model has to say "not visible" when the source is not visible, and the prompt has to enforce that.
    3. Confidence scoring that means something. A confidence score uncorrelated with accuracy is worse than no score. Validate that low confidence predicts low accuracy on your data.
    4. Audit trail. Quebec professional rules require that a notary or lawyer defend every output. The pipeline has to store the source crop, model version, prompt, timestamp, and reviewer sign-off.

    Cost, latency, and auditability decide the architecture more often than the headline accuracy number. The product that wins in 2026 knows when to call the expensive model and when not to.