Discrepancy AI | Blog: Why is pulling financial data out of documents so hard?

Financial information should, theoretically, be an easy task for AI. After all, financial data is structured, number based, and often labelled for the human reader. However, OCR tools have a difficult time doing the work finance that users need.

Surprisingly, the early forms of optical character recognition (or OCR) were designed for banks. Business, military, or scientific use often provided the financial justification for many of AI’s early applications. In the case of OCR, a high volume of bank checks following WWII led Bank of America to collaborate with the Stanford Research Institute. The collaboration with Stanford led to ERMA, a system to automate bookkeeping and proofing. ERMA allowed banks to manage the rising numbers of paper checks amid shortages of staff, and ultimately resulted in the electronic banking we have today. (For OCR history buffs, also notable is Ray Kurzweil and the “omni-font” system he sold to Xerox, Oliver Selfridge’s Pandemonium, and the idea of ‘pooling layers’ put forward by Yann LeCun.)

Financial documents come in a dizzying array of formats, numbering methods, and styles. Documents like invoices, payroll slips, receipts, or tax documents all contain financial data, but differ greatly in how this data must be organized. These differences compound in large datasets – which makes OCR hard to rely on.

Shortcomings of OCR

Despite its early success in banking, there are still challenges associated with OCR. These challenges are well known to users of financial documents and other unstructured PDFs. A smattering of errors or hallucinated data can be expected while using OCR – hallucinated data can invent street names like “Lamefront” instead of “Lanefront” if it wrongly interprets the curve of a poorly scanned “n.”

While not necessarily a problem when working with a single document, irregular results are extremely challenging when working with large datasets with financial data. To ensure accuracy with OCR, you might end up needing to review the entire dataset by hand – eliminating the time-saving benefits entirely.

The unstructured nature of documents like pdfs are designed to show the reader concepts in a way that they find easy to read. Data is often shown in images or tables. These tables don’t contain metadata – it’s easy for a human reader to interpret a bar chart, but less intuitive for AI. OCR tools struggle to reconstruct the structure of these images or tables in spreadsheets or other displays, and may misalign columns or misinterpret row breaks.

OCR does not understand the meaning of the data that it pulls. For example, in many financial documents, accountants will recognize that the number at the bottom of a column with a double underline is the sum total of that column. If you train an OCR model on the idea that the sum of each column lives in the double underlined number below it, it will assume this is true even in cases where it isn’t.

The reason for many OCR errors is probabilities – OCR uses probabilities to predict labels and relationships. OCR systems don’t "understand" the meaning of the data. The model cannot recognize that "tax" or "subtotal" are contextually different from "sum total" without additional training or contextual rules.

So, if the system sees a number at the bottom without a clear label (or a label it hasn’t seen before), it will rely on statistical likelihood. If 90% of columns end in a double underline, and the model is taught that double underline means sum total, it will get the label correct (hopefully) around 90% of the time. However, if the model is given a new document from a real estate agent, who uses a double underline for signatures and dates. The OCR model now assumes each date in the signature column is a sum total. The signatory, Bobby, writes in all caps and a shaky hand. Thus the OCR gives him a new name – 8088Y.

Moving Beyond OCR

While OCR has its place in digitizing text, it’s clear that extracting financial data requires more than static character recognition. Modern solutions, like the ones underlying Discrepancy, combine machine learning, natural language processing (NLP), and advanced algorithms to address these challenges – so users can finally access their unstructured data, in real time.

Why is pulling financial data out of documents so hard?

Shortcomings of OCR

Moving Beyond OCR

Unstructured data can easily be indexed, sorted, filtered, and analyzed by Discrepancy AI

Solutions

Platform

Company

Other

Why is pulling financial data out of documents so hard?

Shortcomings of OCR

Moving Beyond OCR

Related Posts

The Forensics of a $236,000 AI Heist

How Mortgage Brokers Can Use AI to Review Financial Documents Faster—and Smarter

How to Detect the Differences Between Old Policies and New Policies

How tax professionals can leverage AI to review PDFs

Fast income verification for pdfs: For mortgage brokers, AI is more than just a productivity tool

Key Industries That Still Accept Financial Documents Through Unsecured Channels

How does your company accept documents? Why a no code solution to document upload is ideal for your product

The importance of accurate financial data in the tenant screening process

What is AI Document Review Exactly?

Key Things to Look for When Reviewing a Legal Document

The Power of AI in Fraud Detection: Protecting Your Business from Evolving Threats

AI Document Review for Agreement of Purchase of Sale: Revolutionizing Real Estate Transactions

Leveraging AI for Automated Document Review: A Strategic Solution for Preventing Tenant Fraud

Navigating the Future of Legal Work in the AI Era: Insights from Jessica Markowitz

Top 5 Strategies to Excel in the 2024 Legal Sector with Colin Levy

How AI Can Review Contracts and Other Legal Documents for Errors or Inconsistencies

Navigating the Future of Legal Analysis: Embracing AI in Document Review and Due Diligence

Unstructured data can easily be indexed, sorted, filtered, and analyzed by Discrepancy AI

Solutions

Platform

Company

Other